llama-3 8B bpw_3 finetune out of memory

hanhanpp commented 3 weeks ago

Hi, I try to finetune llama-3 8B bpw_3 on a A100/40GB, but it is out of memory. How much memory dose it need? or which model can be finetuned on a single GPU card?

yanghaojin commented 3 weeks ago

Using the following command: CUDA_VISIBLE_DEVICES=0 WANDB_DISABLED=true python -m sft.finetune --model GreenBitAI/Llama-3-8B-layer-mix-bpw-2.2 --tune-qweight-only --galore --galore-rank 64 --optimizer adamw8bit --batch-size 1 --seqlen 96 I can fine-tune on a single RTX3090 with 24GB of GPU memory. You can try to adjust your configuration based on this, such as using bpw3 instead of bpw2.2; using diodemix optimizer (16bit) instead of adamw8bit; removing the option related to --galore*, because galore will slow down the process when recalculating low rank matrices. Use a larger seqlen or batch size, etc.

hanhanpp commented 3 weeks ago

Thanks for your reply!

GreenBitAI / green-bit-llm

llama-3 8B bpw_3 finetune out of memory #17