Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
7.95k stars 797 forks source link

OOM Error on RTX 3090 24 GB with Llama-2-7B-hf #553

Closed khizarhussain19 closed 7 months ago

khizarhussain19 commented 9 months ago

Hi, I am getting OOM when I try to finetune Llama-2-7b-hf.

python3 finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4"
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 50000, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model '/media/khizar/Data/Projects/MedAide/Llama 2/lit-gpt/checkpoints/meta-llama/Llama-2-7b-chat-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-chat-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/home/khizar/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py:945: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
  rank_zero_warn(
Global seed set to 1337
MAX SEQUENCE LENGTH
4096
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
I highly recommend "The Shawshank Redemption" for your upcoming weekend. This movie is a timeless classic that will leave you on the edge of your seat and emotionally invested in the characters. The story follows the journey of two inmates, Andy Dufresne and Red, as they navigate the harsh realities of prison life and ultimately find hope and redemption. The movie's themes of perseverance, friendship,
Estimated TFLOPs: 154.49
Measured TFLOPs: 134.35
Traceback (most recent call last):
  File "finetune/lora.py", line 331, in <module>
    CLI(setup)
  File "/home/khizar/.local/lib/python3.8/site-packages/jsonargparse/_cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/khizar/.local/lib/python3.8/site-packages/jsonargparse/_cli.py", line 147, in _run_component
    return component(**cfg)
  File "finetune/lora.py", line 90, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/home/khizar/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 836, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 922, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 927, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "finetune/lora.py", line 145, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "finetune/lora.py", line 207, in train
    logits = model(input_ids, lm_head_chunk_size=128)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/lightning/fabric/wrappers.py", line 121, in forward
    output = self._forward_module(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/khizar/Data/Projects/MedAide/Llama 2/lit-gpt/lit_gpt/lora.py", line 498, in forward
    x = block(x, cos, sin, mask, input_pos)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/khizar/Data/Projects/MedAide/Llama 2/lit-gpt/lit_gpt/model.py", line 154, in forward
    h = self.attn(n_1, cos, sin, mask, input_pos)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/khizar/Data/Projects/MedAide/Llama 2/lit-gpt/lit_gpt/model.py", line 228, in forward
    return self.proj(y)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/khizar/Data/Projects/MedAide/Llama 2/lit-gpt/lit_gpt/lora.py", line 146, in forward
    pretrained = self.linear(x)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/khizar/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/home/khizar/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/home/khizar/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/khizar/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 23.70 GiB of which 30.31 MiB is free. Including non-PyTorch memory, this process has 23.25 GiB memory in use. Of the allocated memory 22.04 GiB is allocated by PyTorch, and 909.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have already tried all solutions that were mentioned in the OOM guide:

  1. I have tried all possible flags i.e -

    -precision "bf16-true", --quantize "bnb.nf4", --quantize "bnb.nf4-dq",
    --precision "bf16-true" --quantize "bnb.nf4", --precision "bf16-true" --quantize "bnb.nf4-dq".
  2. I have tried with batch_size: 16 and micro_batch_size:1

  3. I am using using a single gpu so I cant really use "sharding accross multiple gpus"

Also, I have tried the exact same configuration and code on RTX 6000 ADA (48 GB VRAM) and even that kept running out of memory.

It would be very helpful if you can help me with this problem. Thank you

carmocca commented 9 months ago

This would be a bug as it doesn't match what was reported in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#running-the-finetuning

goog commented 9 months ago

i have the same problem for run on the lima dataset blog
test command

python finetune/lora.py  --checkpoint_dir checkpoints/NousResearch/Llama-2-7b-hf  --data_dir data/lima --precision bf16-true 
--- a/finetune/lora.py
+++ b/finetune/lora.py
@@ -35,15 +35,15 @@ eval_max_new_tokens = 100
 log_interval = 1
 devices = 1
 # change this value to force a maximum sequence length
-override_max_seq_length = None
+override_max_seq_length = 4096

 # Hyperparameters
 learning_rate = 3e-4
-batch_size = 128
-micro_batch_size = 4
+batch_size = 4
+micro_batch_size = 1
 gradient_accumulation_iters = batch_size // micro_batch_size
 assert gradient_accumulation_iters > 0
-max_iters = 50000  # train dataset size
+max_iters = 1000  # train dataset size
 weight_decay = 0.01
 lora_r = 8
 lora_alpha = 16
khizarhussain19 commented 9 months ago

This would be a bug as it doesn't match what was reported in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#running-the-finetuning

I think so too. I have been running a lot of finetuning on RTX 3090 with lit-llama on 7b and it has been working fine. Any updates on this?

goog commented 9 months ago

here is my 4090 log, the checkpoint is 26G Sep 17 15:52 lit_model.pth

root@autodl-container-90e311ae3c-aec8320b:~/autodl-tmp/lit-gpt# python finetune/lora.py  --checkpoint_dir checkpoints/NousResearch/Llama-2-7b-hf  --data_dir data/lima --precision bf16-true
device  1
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 2048, 'learning_rate': 0.0003, 'batch_size': 2, 'micro_batch_size': 1, 'gradient_accumulation_iters': 2, 'max_iters': 900, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
before launch
Global seed set to 1337
Loading model 'checkpoints/NousResearch/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
I recommend The Little Prince because it is an animated movie based on one of the most famous novels in history. The movie is very sweet and speaks about how to be adult and live life to the fullest. The Little Prince movie is a great way to bond with someone and become closer to them.
 use crate::{
    api_utils::ApiUtils,
    constants::{
        BASE_URL,
        CHANNEL_NAME
max seq length 2048
Estimated TFLOPs: 154.49
Measured TFLOPs: 60.58
Traceback (most recent call last):
  File "finetune/lora.py", line 331, in <module>
    CLI(setup)
  File "/root/miniconda3/lib/python3.8/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/root/miniconda3/lib/python3.8/site-packages/jsonargparse/_cli.py", line 181, in _run_component
    return component(**cfg)
  File "finetune/lora.py", line 92, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "finetune/lora.py", line 147, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "finetune/lora.py", line 207, in train
    logits = model(input_ids, lm_head_chunk_size=128)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/wrappers.py", line 121, in forward
    output = self._forward_module(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/autodl-tmp/lit-gpt/lit_gpt/lora.py", line 498, in forward
    x = block(x, cos, sin, mask, input_pos)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/autodl-tmp/lit-gpt/lit_gpt/model.py", line 154, in forward
    h = self.attn(n_1, cos, sin, mask, input_pos)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/autodl-tmp/lit-gpt/lit_gpt/model.py", line 213, in forward
    q_roped = apply_rope(q[..., : self.config.rope_n_elem], cos, sin)
  File "/root/autodl-tmp/lit-gpt/lit_gpt/model.py", line 338, in apply_rope
    roped = (x * cos) + (rotated * sin)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 23.65 GiB of which 2.56 MiB is free. Process 374222 has 23.64 GiB memory in use. Of the allocated memory 22.95 GiB is allocated by PyTorch, and 237.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
goog commented 9 months ago

how to enable Fabric debug ? is there some i can help? A100 40GB OK

khizarhussain19 commented 9 months ago

Hi, Any updates on this ?

carmocca commented 8 months ago

I tried running finetune/lora.py after applying these changes

--- a/finetune/lora.py
+++ b/finetune/lora.py
@@ -28,9 +28,9 @@ from lit_gpt.utils import (
 )
 from scripts.prepare_alpaca import generate_prompt

-eval_interval = 100
-save_interval = 100
-eval_iters = 100
+eval_interval = 10000
+save_interval = 10000
+eval_iters = 1
 eval_max_new_tokens = 100
 log_interval = 1
 devices = 1
@@ -39,11 +39,11 @@ override_max_seq_length = None

 # Hyperparameters
 learning_rate = 3e-4
-batch_size = 128
-micro_batch_size = 4
+batch_size = 16
+micro_batch_size = 1
 gradient_accumulation_iters = batch_size // micro_batch_size
 assert gradient_accumulation_iters > 0
-max_iters = 50000  # train dataset size
+max_iters = 200  # train dataset size
 weight_decay = 0.01
 lora_r = 8
 lora_alpha = 16
@@ -170,6 +164,10 @@ def train(
 ) -> None:
     tokenizer = Tokenizer(checkpoint_dir)
     max_seq_length, longest_seq_length, longest_seq_ix = get_max_seq_length(train_data)
+    fabric.print(
+        f"The longest sequence length in the train data is {longest_seq_length}, the model's maximum sequence length is"
+        f" {max_seq_length}"
+    )
     model.max_seq_length = max_seq_length

     validate(fabric, model, val_data, tokenizer, longest_seq_length)  # sanity check

And got (on an A100 40GB)

$ python3 finetune/lora.py --precision "bf16-true" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf
{'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304
...
iter 0 step 0: loss 1.2245, iter time: 1360.27ms
iter 1 step 0: loss 2.6754, iter time: 158.58ms
iter 2 step 0: loss 1.9239, iter time: 127.87ms
iter 3 step 0: loss 2.6522, iter time: 125.67ms
iter 4 step 0: loss 2.5902, iter time: 131.11ms
iter 5 step 0: loss 2.4788, iter time: 128.62ms
iter 6 step 0: loss 1.6953, iter time: 125.60ms
iter 7 step 0: loss 2.1224, iter time: 149.03ms
iter 8 step 0: loss 2.3261, iter time: 148.92ms
iter 9 step 0: loss 2.3034, iter time: 125.98ms
iter 10 step 0: loss 1.6996, iter time: 125.11ms
iter 11 step 0: loss 1.9687, iter time: 130.14ms
iter 12 step 0: loss 1.8300, iter time: 126.29ms
iter 13 step 0: loss 1.7750, iter time: 148.74ms
iter 14 step 0: loss 2.5138, iter time: 125.65ms
iter 15 step 1: loss 2.0908, iter time: 259.94ms (optimizer.step)
iter 16 step 1: loss 1.7922, iter time: 130.80ms
iter 17 step 1: loss 1.8289, iter time: 130.33ms
iter 18 step 1: loss 1.8397, iter time: 1107.58ms
iter 19 step 1: loss 2.4835, iter time: 126.96ms
iter 20 step 1: loss 1.7060, iter time: 126.78ms
iter 21 step 1: loss 2.1171, iter time: 126.93ms
iter 22 step 1: loss 2.3113, iter time: 149.04ms
iter 23 step 1: loss 2.0532, iter time: 127.43ms
iter 24 step 1: loss 1.4194, iter time: 125.51ms
iter 25 step 1: loss 1.9506, iter time: 128.39ms
iter 26 step 1: loss 0.9942, iter time: 119.91ms
iter 27 step 1: loss 0.9594, iter time: 148.44ms
iter 28 step 1: loss 2.0504, iter time: 125.90ms
iter 29 step 1: loss 2.1218, iter time: 119.85ms
iter 30 step 1: loss 1.3757, iter time: 126.49ms
iter 31 step 2: loss 1.5776, iter time: 129.43ms (optimizer.step)
...
iter 175 step 11: loss 2.2485, iter time: 147.18ms (optimizer.step)
iter 176 step 11: loss 1.7425, iter time: 120.94ms
iter 177 step 11: loss 2.2430, iter time: 121.42ms
iter 178 step 11: loss 1.9056, iter time: 119.95ms
iter 179 step 11: loss 1.8311, iter time: 151.10ms
iter 180 step 11: loss 2.6305, iter time: 121.09ms
iter 181 step 11: loss 2.5403, iter time: 124.34ms
iter 182 step 11: loss 0.4891, iter time: 129.59ms
iter 183 step 11: loss 1.8725, iter time: 122.12ms
iter 184 step 11: loss 2.0439, iter time: 144.15ms
iter 185 step 11: loss 2.0594, iter time: 120.21ms
iter 186 step 11: loss 2.4870, iter time: 120.96ms
iter 187 step 11: loss 2.8875, iter time: 121.34ms
iter 188 step 11: loss 1.1589, iter time: 145.00ms
iter 189 step 11: loss 2.0229, iter time: 120.92ms
iter 190 step 11: loss 1.4625, iter time: 127.12ms
iter 191 step 12: loss 2.3915, iter time: 130.17ms (optimizer.step)
iter 192 step 12: loss 1.5822, iter time: 127.88ms
iter 193 step 12: loss 2.0108, iter time: 144.54ms
iter 194 step 12: loss 2.2593, iter time: 125.53ms
iter 195 step 12: loss 1.7429, iter time: 121.84ms
iter 196 step 12: loss 2.2730, iter time: 122.11ms
iter 197 step 12: loss 2.1806, iter time: 144.96ms
iter 198 step 12: loss 2.5569, iter time: 126.54ms
iter 199 step 12: loss 1.6543, iter time: 126.85ms
Training time: 37.78s
Memory used: 21.30 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
$ python3 finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf
{'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/home/carlos/lightning/src/lightning/fabric/fabric.py:943: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
  rank_zero_warn(
Global seed set to 1337
The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304
...
iter 0 step 0: loss 1.2525, iter time: 1669.85ms
iter 1 step 0: loss 2.8233, iter time: 193.69ms
iter 2 step 0: loss 1.9801, iter time: 176.94ms
iter 3 step 0: loss 2.7705, iter time: 165.77ms
iter 4 step 0: loss 2.7712, iter time: 187.49ms
iter 5 step 0: loss 2.5577, iter time: 168.91ms
iter 6 step 0: loss 1.7638, iter time: 174.00ms
iter 7 step 0: loss 2.1898, iter time: 180.07ms
iter 8 step 0: loss 2.3975, iter time: 185.23ms
iter 9 step 0: loss 2.3578, iter time: 1912.99ms
iter 10 step 0: loss 1.7651, iter time: 192.20ms
iter 11 step 0: loss 2.0168, iter time: 170.73ms
iter 12 step 0: loss 1.9042, iter time: 170.47ms
iter 13 step 0: loss 1.8338, iter time: 174.77ms
iter 14 step 0: loss 2.5964, iter time: 169.38ms
iter 15 step 1: loss 2.1718, iter time: 207.24ms (optimizer.step)
iter 16 step 1: loss 1.8338, iter time: 172.63ms
iter 17 step 1: loss 1.8683, iter time: 174.90ms
iter 18 step 1: loss 1.9124, iter time: 170.26ms
iter 19 step 1: loss 2.5362, iter time: 169.91ms
iter 20 step 1: loss 1.7660, iter time: 176.63ms
iter 21 step 1: loss 2.2030, iter time: 170.23ms
iter 22 step 1: loss 2.3833, iter time: 169.41ms
iter 23 step 1: loss 2.1399, iter time: 173.93ms
iter 24 step 1: loss 1.4729, iter time: 175.55ms
iter 25 step 1: loss 2.0184, iter time: 170.25ms
iter 26 step 1: loss 1.0581, iter time: 170.35ms
iter 27 step 1: loss 0.9963, iter time: 216.12ms
iter 28 step 1: loss 2.1273, iter time: 171.12ms
iter 29 step 1: loss 2.2008, iter time: 170.71ms
iter 30 step 1: loss 1.4045, iter time: 175.52ms
iter 31 step 2: loss 1.6839, iter time: 204.31ms (optimizer.step)
....
iter 175 step 11: loss 2.3226, iter time: 196.31ms (optimizer.step)
iter 176 step 11: loss 1.7935, iter time: 170.30ms
iter 177 step 11: loss 2.3546, iter time: 164.15ms
iter 178 step 11: loss 1.9425, iter time: 172.43ms
iter 179 step 11: loss 1.8658, iter time: 176.41ms
iter 180 step 11: loss 2.6817, iter time: 164.22ms
iter 181 step 11: loss 2.6702, iter time: 163.33ms
iter 182 step 11: loss 0.5291, iter time: 214.21ms
iter 183 step 11: loss 1.9405, iter time: 165.76ms
iter 184 step 11: loss 2.0919, iter time: 170.25ms
iter 185 step 11: loss 2.0953, iter time: 170.95ms
iter 186 step 11: loss 2.5554, iter time: 164.11ms
iter 187 step 11: loss 3.0476, iter time: 164.04ms
iter 188 step 11: loss 1.2179, iter time: 173.06ms
iter 189 step 11: loss 2.1183, iter time: 163.83ms
iter 190 step 11: loss 1.5120, iter time: 174.28ms
iter 191 step 12: loss 2.4990, iter time: 200.10ms (optimizer.step)
iter 192 step 12: loss 1.6157, iter time: 174.68ms
iter 193 step 12: loss 2.0580, iter time: 171.17ms
iter 194 step 12: loss 2.3686, iter time: 165.38ms
iter 195 step 12: loss 1.8248, iter time: 169.93ms
iter 196 step 12: loss 2.3266, iter time: 165.40ms
iter 197 step 12: loss 2.2480, iter time: 165.07ms
iter 198 step 12: loss 2.7058, iter time: 164.83ms
iter 199 step 12: loss 1.6987, iter time: 176.20ms
Training time: 53.85s
Memory used: 14.14 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
khizarhussain19 commented 8 months ago

I tried running finetune/lora.py after applying these changes

--- a/finetune/lora.py
+++ b/finetune/lora.py
@@ -28,9 +28,9 @@ from lit_gpt.utils import (
 )
 from scripts.prepare_alpaca import generate_prompt

-eval_interval = 100
-save_interval = 100
-eval_iters = 100
+eval_interval = 10000
+save_interval = 10000
+eval_iters = 1
 eval_max_new_tokens = 100
 log_interval = 1
 devices = 1
@@ -39,11 +39,11 @@ override_max_seq_length = None

 # Hyperparameters
 learning_rate = 3e-4
-batch_size = 128
-micro_batch_size = 4
+batch_size = 16
+micro_batch_size = 1
 gradient_accumulation_iters = batch_size // micro_batch_size
 assert gradient_accumulation_iters > 0
-max_iters = 50000  # train dataset size
+max_iters = 200  # train dataset size
 weight_decay = 0.01
 lora_r = 8
 lora_alpha = 16
@@ -170,6 +164,10 @@ def train(
 ) -> None:
     tokenizer = Tokenizer(checkpoint_dir)
     max_seq_length, longest_seq_length, longest_seq_ix = get_max_seq_length(train_data)
+    fabric.print(
+        f"The longest sequence length in the train data is {longest_seq_length}, the model's maximum sequence length is"
+        f" {max_seq_length}"
+    )
     model.max_seq_length = max_seq_length

     validate(fabric, model, val_data, tokenizer, longest_seq_length)  # sanity check

And got (on an A100 40GB)

$ python3 finetune/lora.py --precision "bf16-true" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf
{'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304
...
iter 0 step 0: loss 1.2245, iter time: 1360.27ms
iter 1 step 0: loss 2.6754, iter time: 158.58ms
iter 2 step 0: loss 1.9239, iter time: 127.87ms
iter 3 step 0: loss 2.6522, iter time: 125.67ms
iter 4 step 0: loss 2.5902, iter time: 131.11ms
iter 5 step 0: loss 2.4788, iter time: 128.62ms
iter 6 step 0: loss 1.6953, iter time: 125.60ms
iter 7 step 0: loss 2.1224, iter time: 149.03ms
iter 8 step 0: loss 2.3261, iter time: 148.92ms
iter 9 step 0: loss 2.3034, iter time: 125.98ms
iter 10 step 0: loss 1.6996, iter time: 125.11ms
iter 11 step 0: loss 1.9687, iter time: 130.14ms
iter 12 step 0: loss 1.8300, iter time: 126.29ms
iter 13 step 0: loss 1.7750, iter time: 148.74ms
iter 14 step 0: loss 2.5138, iter time: 125.65ms
iter 15 step 1: loss 2.0908, iter time: 259.94ms (optimizer.step)
iter 16 step 1: loss 1.7922, iter time: 130.80ms
iter 17 step 1: loss 1.8289, iter time: 130.33ms
iter 18 step 1: loss 1.8397, iter time: 1107.58ms
iter 19 step 1: loss 2.4835, iter time: 126.96ms
iter 20 step 1: loss 1.7060, iter time: 126.78ms
iter 21 step 1: loss 2.1171, iter time: 126.93ms
iter 22 step 1: loss 2.3113, iter time: 149.04ms
iter 23 step 1: loss 2.0532, iter time: 127.43ms
iter 24 step 1: loss 1.4194, iter time: 125.51ms
iter 25 step 1: loss 1.9506, iter time: 128.39ms
iter 26 step 1: loss 0.9942, iter time: 119.91ms
iter 27 step 1: loss 0.9594, iter time: 148.44ms
iter 28 step 1: loss 2.0504, iter time: 125.90ms
iter 29 step 1: loss 2.1218, iter time: 119.85ms
iter 30 step 1: loss 1.3757, iter time: 126.49ms
iter 31 step 2: loss 1.5776, iter time: 129.43ms (optimizer.step)
...
iter 175 step 11: loss 2.2485, iter time: 147.18ms (optimizer.step)
iter 176 step 11: loss 1.7425, iter time: 120.94ms
iter 177 step 11: loss 2.2430, iter time: 121.42ms
iter 178 step 11: loss 1.9056, iter time: 119.95ms
iter 179 step 11: loss 1.8311, iter time: 151.10ms
iter 180 step 11: loss 2.6305, iter time: 121.09ms
iter 181 step 11: loss 2.5403, iter time: 124.34ms
iter 182 step 11: loss 0.4891, iter time: 129.59ms
iter 183 step 11: loss 1.8725, iter time: 122.12ms
iter 184 step 11: loss 2.0439, iter time: 144.15ms
iter 185 step 11: loss 2.0594, iter time: 120.21ms
iter 186 step 11: loss 2.4870, iter time: 120.96ms
iter 187 step 11: loss 2.8875, iter time: 121.34ms
iter 188 step 11: loss 1.1589, iter time: 145.00ms
iter 189 step 11: loss 2.0229, iter time: 120.92ms
iter 190 step 11: loss 1.4625, iter time: 127.12ms
iter 191 step 12: loss 2.3915, iter time: 130.17ms (optimizer.step)
iter 192 step 12: loss 1.5822, iter time: 127.88ms
iter 193 step 12: loss 2.0108, iter time: 144.54ms
iter 194 step 12: loss 2.2593, iter time: 125.53ms
iter 195 step 12: loss 1.7429, iter time: 121.84ms
iter 196 step 12: loss 2.2730, iter time: 122.11ms
iter 197 step 12: loss 2.1806, iter time: 144.96ms
iter 198 step 12: loss 2.5569, iter time: 126.54ms
iter 199 step 12: loss 1.6543, iter time: 126.85ms
Training time: 37.78s
Memory used: 21.30 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
$ python3 finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf
{'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/home/carlos/lightning/src/lightning/fabric/fabric.py:943: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
  rank_zero_warn(
Global seed set to 1337
The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304
...
iter 0 step 0: loss 1.2525, iter time: 1669.85ms
iter 1 step 0: loss 2.8233, iter time: 193.69ms
iter 2 step 0: loss 1.9801, iter time: 176.94ms
iter 3 step 0: loss 2.7705, iter time: 165.77ms
iter 4 step 0: loss 2.7712, iter time: 187.49ms
iter 5 step 0: loss 2.5577, iter time: 168.91ms
iter 6 step 0: loss 1.7638, iter time: 174.00ms
iter 7 step 0: loss 2.1898, iter time: 180.07ms
iter 8 step 0: loss 2.3975, iter time: 185.23ms
iter 9 step 0: loss 2.3578, iter time: 1912.99ms
iter 10 step 0: loss 1.7651, iter time: 192.20ms
iter 11 step 0: loss 2.0168, iter time: 170.73ms
iter 12 step 0: loss 1.9042, iter time: 170.47ms
iter 13 step 0: loss 1.8338, iter time: 174.77ms
iter 14 step 0: loss 2.5964, iter time: 169.38ms
iter 15 step 1: loss 2.1718, iter time: 207.24ms (optimizer.step)
iter 16 step 1: loss 1.8338, iter time: 172.63ms
iter 17 step 1: loss 1.8683, iter time: 174.90ms
iter 18 step 1: loss 1.9124, iter time: 170.26ms
iter 19 step 1: loss 2.5362, iter time: 169.91ms
iter 20 step 1: loss 1.7660, iter time: 176.63ms
iter 21 step 1: loss 2.2030, iter time: 170.23ms
iter 22 step 1: loss 2.3833, iter time: 169.41ms
iter 23 step 1: loss 2.1399, iter time: 173.93ms
iter 24 step 1: loss 1.4729, iter time: 175.55ms
iter 25 step 1: loss 2.0184, iter time: 170.25ms
iter 26 step 1: loss 1.0581, iter time: 170.35ms
iter 27 step 1: loss 0.9963, iter time: 216.12ms
iter 28 step 1: loss 2.1273, iter time: 171.12ms
iter 29 step 1: loss 2.2008, iter time: 170.71ms
iter 30 step 1: loss 1.4045, iter time: 175.52ms
iter 31 step 2: loss 1.6839, iter time: 204.31ms (optimizer.step)
....
iter 175 step 11: loss 2.3226, iter time: 196.31ms (optimizer.step)
iter 176 step 11: loss 1.7935, iter time: 170.30ms
iter 177 step 11: loss 2.3546, iter time: 164.15ms
iter 178 step 11: loss 1.9425, iter time: 172.43ms
iter 179 step 11: loss 1.8658, iter time: 176.41ms
iter 180 step 11: loss 2.6817, iter time: 164.22ms
iter 181 step 11: loss 2.6702, iter time: 163.33ms
iter 182 step 11: loss 0.5291, iter time: 214.21ms
iter 183 step 11: loss 1.9405, iter time: 165.76ms
iter 184 step 11: loss 2.0919, iter time: 170.25ms
iter 185 step 11: loss 2.0953, iter time: 170.95ms
iter 186 step 11: loss 2.5554, iter time: 164.11ms
iter 187 step 11: loss 3.0476, iter time: 164.04ms
iter 188 step 11: loss 1.2179, iter time: 173.06ms
iter 189 step 11: loss 2.1183, iter time: 163.83ms
iter 190 step 11: loss 1.5120, iter time: 174.28ms
iter 191 step 12: loss 2.4990, iter time: 200.10ms (optimizer.step)
iter 192 step 12: loss 1.6157, iter time: 174.68ms
iter 193 step 12: loss 2.0580, iter time: 171.17ms
iter 194 step 12: loss 2.3686, iter time: 165.38ms
iter 195 step 12: loss 1.8248, iter time: 169.93ms
iter 196 step 12: loss 2.3266, iter time: 165.40ms
iter 197 step 12: loss 2.2480, iter time: 165.07ms
iter 198 step 12: loss 2.7058, iter time: 164.83ms
iter 199 step 12: loss 1.6987, iter time: 176.20ms
Training time: 53.85s
Memory used: 14.14 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'

Okay but what about RTX 3090 24 GB ? Its not working there even with these changes.

carmocca commented 7 months ago

For memory issues, please refer to the OOM guide: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md