Current default example does not converge

Thytu / SMIT

SMIT: A Simple Modality Integration Tool

MIT License

15 stars 3 forks source link

Current default example does not converge #8

Closed Thytu closed 5 months ago

Thytu commented 5 months ago

The current default example aims to be able to be run on a A100 40Go, thus it use Quantization nf4 and LoRA. However those changes seems to prevent the model to convert.

The default example should be re-written to ~~both fit in a A100 40Go AND~~ converge.

Quantization + LoRA (VRAM: 18Go)

Fine-tuning (linear only, w/o Continual Learning) fp16 (VRAM: 52Go)

Fine-tuning (linear only, w/ Continual Learning) fp16 (VRAM: 52Go)

Fine-tuning all fp16 (VRAM: 79Go)

Thytu commented 5 months ago

Update: even by no-longer applying continuous learning while fine-tuning using bf16 the model still does not converge.

Meaning a bug has been either integrated into the data_handler, in the training process or in forward method.

This showcase two things:

A suit test should be written and integrated to SMIT #11
Issue #6 should be resolved ASAP (would help fix this kind of issue

Thytu commented 5 months ago

Even after using a rollback version of data_handler the models still doesn't converge which might indicate an issue in either the forward methods or the training algorithm.

I'm currently running a training run using the 2bf64a77d726c3c13a93138c34b1067119315a41 commit to still if the issue still occurs.

Thytu commented 5 months ago

While 2bf64a77d726c3c13a93138c34b1067119315a41 does seem to converge, it still takes an abnormal amount of time.

Now testing a rollback at a1df5f6d26bfc79872562e9178b35dab78b436a2

Thytu commented 5 months ago

https://github.com/Thytu/SMIT/commit/a1df5f6d26bfc79872562e9178b35dab78b436a2 does converge

Now investigating which part of the code is faulty

Thytu commented 5 months ago

I've identified two issues with the current setup:

It appears crucial to freeze the non-linear layers of the decoder.
There's an unresolved bug: Previously, training the entire model, including the non-linear layers, was possible, but it's no longer feasible in the latest version.

Regarding the first issue, a fix is forthcoming. As for the second one, I'll prioritize other tasks for now and defer addressing it.

Thytu commented 5 months ago

Quantization to 4bit also prevents the model to converge, might also be a good idea to create a short guide on what works and what doesn't as I'm already experimenting quite a lot with different configs.

Thytu commented 5 months ago

Splitting that issue into two different issues:

Fixing the default example in order to make it converge (that one)
Making the default example GPU-poor friendly