ayulockin / neurips-llm-efficiency-challenge

Starter pack for NeurIPS LLM Efficiency Challenge 2023.
https://llm-efficiency-challenge.github.io/challenge
Apache License 2.0
118 stars 44 forks source link

OOM error while running LoRa #5

Open bmanikan opened 1 year ago

bmanikan commented 1 year ago

My Parameters are:

python lit-gpt/finetune/lora.py --data_dir data/dolly/ --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --precision bf16-true --out_dir out/lora/llama-2-7b
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 512, 'learning_rate': 0.0002, 'batch_size': 1, 'micro_batch_size': 1, 'gradient_accumulation_iters': 1, 'max_iters': 20000, 'weight_decay': 0.01, 'lora_r': 4, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': True, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 4, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': True, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}

Error Trace is:

Estimated TFLOPs: 154.46
Measured TFLOPs: 134.33
Traceback (most recent call last):
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 390, in <module>
    CLI(setup)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/jsonargparse/_cli.py", line 181, in _run_component
    return component(**cfg)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 116, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 177, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/finetune/lora.py", line 248, in train
    logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=64)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 118, in forward
    output = self._forward_module(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/lora.py", line 525, in forward
    x, *_ = block(x, (cos, sin), max_seq_length)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/model.py", line 173, in forward
    x = x + self.mlp(self.norm_2(x))
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/model.py", line 294, in forward
    x_fc_2 = self.fc_2(x)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/balamanikandan/Desktop/BALA/Projects/LLM/neurips-llm-efficiency-challenge/lit-gpt/lit_gpt/lora.py", line 146, in forward
    pretrained = self.linear(x)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 41.75 MiB is free. Including non-PyTorch memory, this process has 23.15 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 481.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I went all the way down to '1' batch_size and reduced all the parameters, but still getting this OOM error.

I have RTX 4090 GPU with 24Gb Vram

Can anyone help?

ayulockin commented 1 year ago

Try turning lora_key off (lora_key=False).

You can also try to lower the lora_rank to 2 from 4.

Your override_max_seq_length is 512 which can be reduced as well.

Note that all this will reduce the quality of the fine-tuned model but you will atleast have a baseline and can work from there.

bmanikan commented 1 year ago

Tried it, Still getting the OOM error. I went till 8 for override_max_seq_length I am using CUDA version 12.1 will that be a problem?

nahidalam commented 1 year ago

I have the same problem. I think 24GB memory is not enough for this.

ayulockin commented 1 year ago

Did you try QLoRA to fine-tune? I guess quantising to LoRA weights to 4 bits might help.

Another suggestion would be to use SGD optimiser instead of the AdamW. Adam optimiser maintains two states per trainable parameters requiring double the memory. Using SGD might help. You can change the optimiser in this line of code: https://github.com/ayulockin/lit-gpt/blob/b6829289f977e65c3588bbb28737986fe38f8ec1/finetune/lora.py#L154

bmanikan commented 1 year ago

I have tried both QLoRA and SGD but no luck. The 3b model runs perfectly Does having 32Gb RAM affects in this case?

ayulockin commented 1 year ago

I don't think RAM should be an issue here. I am finetuning a 7b model on A100 now with Lion optimizer, micro_batch_size=2 and batch_size of 128.

image
nahidalam commented 1 year ago

@ayulockin is your A100 40GB or 80GB?

nahidalam commented 1 year ago

@bmanikan which 3b model you used?

nahidalam commented 1 year ago

Got hold of an A100 with 40GB memory. Getting into the same issue. I tried everything that @ayulockin suggested

My parameters are

python lit-gpt/finetune/lora.py --data_dir data/dolly/ --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --precision bf16-true --out_dir out/lora/llama-2-7b
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 256, 'learning_rate': 0.0002, 'batch_size': 4, 'micro_batch_size': 2, 'gradient_accumulation_iters': 2, 'max_iters': 20000, 'weight_decay': 0.01, 'lora_r': 2, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 2, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}

OOM error message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 99.06 MiB is free. Including non-PyTorch memory, this process has 39.29 GiB memory in use. Of the allocated memory 37.73 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ayulockin commented 1 year ago

I have a 40gb A100.

Is your flash attention correctly installed?

nahidalam commented 1 year ago

@ayulockin I validated the setup with the command you shared and it seemed fine.

python lit-gpt/generate/base.py --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf --prompt "Tell me an interesting fun fact about earth:"