Open bmanikan opened 1 year ago
Try turning lora_key
off (lora_key=False
).
You can also try to lower the lora_rank
to 2 from 4.
Your override_max_seq_length
is 512 which can be reduced as well.
Note that all this will reduce the quality of the fine-tuned model but you will atleast have a baseline and can work from there.
Tried it, Still getting the OOM error. I went till 8 for override_max_seq_length
I am using CUDA version 12.1 will that be a problem?
I have the same problem. I think 24GB memory is not enough for this.
Did you try QLoRA to fine-tune? I guess quantising to LoRA weights to 4 bits might help.
Another suggestion would be to use SGD optimiser instead of the AdamW. Adam optimiser maintains two states per trainable parameters requiring double the memory. Using SGD might help. You can change the optimiser in this line of code: https://github.com/ayulockin/lit-gpt/blob/b6829289f977e65c3588bbb28737986fe38f8ec1/finetune/lora.py#L154
I have tried both QLoRA and SGD but no luck. The 3b model runs perfectly Does having 32Gb RAM affects in this case?
I don't think RAM should be an issue here. I am finetuning a 7b model on A100 now with Lion optimizer, micro_batch_size=2 and batch_size of 128.
@ayulockin is your A100 40GB or 80GB?
@bmanikan which 3b model you used?
Got hold of an A100 with 40GB memory. Getting into the same issue. I tried everything that @ayulockin suggested
My parameters are
python lit-gpt/finetune/lora.py --data_dir data/dolly/ --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --precision bf16-true --out_dir out/lora/llama-2-7b
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 256, 'learning_rate': 0.0002, 'batch_size': 4, 'micro_batch_size': 2, 'gradient_accumulation_iters': 2, 'max_iters': 20000, 'weight_decay': 0.01, 'lora_r': 2, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 2, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
OOM error message
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 99.06 MiB is free. Including non-PyTorch memory, this process has 39.29 GiB memory in use. Of the allocated memory 37.73 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have a 40gb A100.
Is your flash attention correctly installed?
@ayulockin I validated the setup with the command you shared and it seemed fine.
python lit-gpt/generate/base.py --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --prompt "Tell me an interesting fun fact about earth:"
My Parameters are:
Error Trace is:
I went all the way down to '1' batch_size and reduced all the parameters, but still getting this OOM error.
I have RTX 4090 GPU with 24Gb Vram
Can anyone help?