Thytu / SMIT

SMIT: A Simple Modality Integration Tool
MIT License
15 stars 3 forks source link

Current default example does not converge #8

Closed Thytu closed 5 months ago

Thytu commented 5 months ago

The current default example aims to be able to be run on a A100 40Go, thus it use Quantization nf4 and LoRA. However those changes seems to prevent the model to convert.

The default example should be re-written to both fit in a A100 40Go AND converge.

Quantization + LoRA (VRAM: 18Go) image

Fine-tuning (linear only, w/o Continual Learning) fp16 (VRAM: 52Go) image

Fine-tuning (linear only, w/ Continual Learning) fp16 (VRAM: 52Go)

image

Fine-tuning all fp16 (VRAM: 79Go)

image
Thytu commented 5 months ago

Update: even by no-longer applying continuous learning while fine-tuning using bf16 the model still does not converge.

image

Meaning a bug has been either integrated into the data_handler, in the training process or in forward method.

This showcase two things:

  1. A suit test should be written and integrated to SMIT #11
  2. Issue #6 should be resolved ASAP (would help fix this kind of issue
Thytu commented 5 months ago

Even after using a rollback version of data_handler the models still doesn't converge which might indicate an issue in either the forward methods or the training algorithm.

image

I'm currently running a training run using the 2bf64a77d726c3c13a93138c34b1067119315a41 commit to still if the issue still occurs.

Thytu commented 5 months ago

While 2bf64a77d726c3c13a93138c34b1067119315a41 does seem to converge, it still takes an abnormal amount of time.

image

Now testing a rollback at a1df5f6d26bfc79872562e9178b35dab78b436a2

Thytu commented 5 months ago

https://github.com/Thytu/SMIT/commit/a1df5f6d26bfc79872562e9178b35dab78b436a2 does converge

image

Now investigating which part of the code is faulty

Thytu commented 5 months ago

I've identified two issues with the current setup:

  1. It appears crucial to freeze the non-linear layers of the decoder.
  2. There's an unresolved bug: Previously, training the entire model, including the non-linear layers, was possible, but it's no longer feasible in the latest version.

Regarding the first issue, a fix is forthcoming. As for the second one, I'll prioritize other tasks for now and defer addressing it.

Thytu commented 5 months ago

Quantization to 4bit also prevents the model to converge, might also be a good idea to create a short guide on what works and what doesn't as I'm already experimenting quite a lot with different configs.

Thytu commented 5 months ago

Splitting that issue into two different issues:

  1. Fixing the default example in order to make it converge (that one)
  2. Making the default example GPU-poor friendly