SMIT default example should be GPU-poor friendly

Thytu commented 5 months ago

Once #8 is merged the default example will consume a considerable amount of VRAM (~77Go) which prevents many potential users to test SMIT.

SMIT has been designed to be GPU-poor friendly from the beginning and its default example should showcase it.

There is still plenty of room for improvement to reduce VRAM usage, some of are:

Changing the optimizer : galore_adamw and adamw_bnb_8bit could be a good options
LoRA: it has been deactivated when fixing #8 but a solution to re-enable it should be investigated

For more ideas: Methods and tools for efficient training on a single GPU

Thytu commented 5 months ago

More information on optimizer's impact over VRAM usage

Thytu commented 5 months ago

Using adamw_bnb_8bit as optimizer + quantizing the decoder to 4bits seems to be a good option. I just need to find the right values for batch size (BS) and gradient accumulation (GA).

Thytu commented 5 months ago

Batch size of 2 with gradient accumulation of 4 works but takes ~400mins (~6h) to converge which is way too long. Trying to increase BS by 1

Thytu commented 5 months ago

Increasing BS to 3 resulting in a VRAM above 40GB (~46GB). I'll probably release it with Quant nf4 BS2 GA4.

Thytu / SMIT

SMIT default example should be GPU-poor friendly #12