Closed khizarhussain19 closed 7 months ago
This would be a bug as it doesn't match what was reported in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#running-the-finetuning
i have the same problem for run on the lima dataset blog
test command
python finetune/lora.py --checkpoint_dir checkpoints/NousResearch/Llama-2-7b-hf --data_dir data/lima --precision bf16-true
--- a/finetune/lora.py
+++ b/finetune/lora.py
@@ -35,15 +35,15 @@ eval_max_new_tokens = 100
log_interval = 1
devices = 1
# change this value to force a maximum sequence length
-override_max_seq_length = None
+override_max_seq_length = 4096
# Hyperparameters
learning_rate = 3e-4
-batch_size = 128
-micro_batch_size = 4
+batch_size = 4
+micro_batch_size = 1
gradient_accumulation_iters = batch_size // micro_batch_size
assert gradient_accumulation_iters > 0
-max_iters = 50000 # train dataset size
+max_iters = 1000 # train dataset size
weight_decay = 0.01
lora_r = 8
lora_alpha = 16
This would be a bug as it doesn't match what was reported in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/finetune_lora.md#running-the-finetuning
I think so too. I have been running a lot of finetuning on RTX 3090 with lit-llama on 7b and it has been working fine. Any updates on this?
here is my 4090 log, the checkpoint is 26G Sep 17 15:52 lit_model.pth
root@autodl-container-90e311ae3c-aec8320b:~/autodl-tmp/lit-gpt# python finetune/lora.py --checkpoint_dir checkpoints/NousResearch/Llama-2-7b-hf --data_dir data/lima --precision bf16-true
device 1
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'override_max_seq_length': 2048, 'learning_rate': 0.0003, 'batch_size': 2, 'micro_batch_size': 1, 'gradient_accumulation_iters': 2, 'max_iters': 900, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
before launch
Global seed set to 1337
Loading model 'checkpoints/NousResearch/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.
### Response:
I recommend The Little Prince because it is an animated movie based on one of the most famous novels in history. The movie is very sweet and speaks about how to be adult and live life to the fullest. The Little Prince movie is a great way to bond with someone and become closer to them.
use crate::{
api_utils::ApiUtils,
constants::{
BASE_URL,
CHANNEL_NAME
max seq length 2048
Estimated TFLOPs: 154.49
Measured TFLOPs: 60.58
Traceback (most recent call last):
File "finetune/lora.py", line 331, in <module>
CLI(setup)
File "/root/miniconda3/lib/python3.8/site-packages/jsonargparse/_cli.py", line 96, in CLI
return _run_component(components, cfg_init)
File "/root/miniconda3/lib/python3.8/site-packages/jsonargparse/_cli.py", line 181, in _run_component
return component(**cfg)
File "finetune/lora.py", line 92, in setup
fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 834, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
return to_run(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
return to_run(*args, **kwargs)
File "finetune/lora.py", line 147, in main
train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
File "finetune/lora.py", line 207, in train
logits = model(input_ids, lm_head_chunk_size=128)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/wrappers.py", line 121, in forward
output = self._forward_module(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/root/autodl-tmp/lit-gpt/lit_gpt/lora.py", line 498, in forward
x = block(x, cos, sin, mask, input_pos)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/root/autodl-tmp/lit-gpt/lit_gpt/model.py", line 154, in forward
h = self.attn(n_1, cos, sin, mask, input_pos)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, **kwargs)
File "/root/autodl-tmp/lit-gpt/lit_gpt/model.py", line 213, in forward
q_roped = apply_rope(q[..., : self.config.rope_n_elem], cos, sin)
File "/root/autodl-tmp/lit-gpt/lit_gpt/model.py", line 338, in apply_rope
roped = (x * cos) + (rotated * sin)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 23.65 GiB of which 2.56 MiB is free. Process 374222 has 23.64 GiB memory in use. Of the allocated memory 22.95 GiB is allocated by PyTorch, and 237.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
how to enable Fabric debug ? is there some i can help? A100 40GB OK
Hi, Any updates on this ?
I tried running finetune/lora.py
after applying these changes
--- a/finetune/lora.py
+++ b/finetune/lora.py
@@ -28,9 +28,9 @@ from lit_gpt.utils import (
)
from scripts.prepare_alpaca import generate_prompt
-eval_interval = 100
-save_interval = 100
-eval_iters = 100
+eval_interval = 10000
+save_interval = 10000
+eval_iters = 1
eval_max_new_tokens = 100
log_interval = 1
devices = 1
@@ -39,11 +39,11 @@ override_max_seq_length = None
# Hyperparameters
learning_rate = 3e-4
-batch_size = 128
-micro_batch_size = 4
+batch_size = 16
+micro_batch_size = 1
gradient_accumulation_iters = batch_size // micro_batch_size
assert gradient_accumulation_iters > 0
-max_iters = 50000 # train dataset size
+max_iters = 200 # train dataset size
weight_decay = 0.01
lora_r = 8
lora_alpha = 16
@@ -170,6 +164,10 @@ def train(
) -> None:
tokenizer = Tokenizer(checkpoint_dir)
max_seq_length, longest_seq_length, longest_seq_ix = get_max_seq_length(train_data)
+ fabric.print(
+ f"The longest sequence length in the train data is {longest_seq_length}, the model's maximum sequence length is"
+ f" {max_seq_length}"
+ )
model.max_seq_length = max_seq_length
validate(fabric, model, val_data, tokenizer, longest_seq_length) # sanity check
And got (on an A100 40GB)
$ python3 finetune/lora.py --precision "bf16-true" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf
{'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304
...
iter 0 step 0: loss 1.2245, iter time: 1360.27ms
iter 1 step 0: loss 2.6754, iter time: 158.58ms
iter 2 step 0: loss 1.9239, iter time: 127.87ms
iter 3 step 0: loss 2.6522, iter time: 125.67ms
iter 4 step 0: loss 2.5902, iter time: 131.11ms
iter 5 step 0: loss 2.4788, iter time: 128.62ms
iter 6 step 0: loss 1.6953, iter time: 125.60ms
iter 7 step 0: loss 2.1224, iter time: 149.03ms
iter 8 step 0: loss 2.3261, iter time: 148.92ms
iter 9 step 0: loss 2.3034, iter time: 125.98ms
iter 10 step 0: loss 1.6996, iter time: 125.11ms
iter 11 step 0: loss 1.9687, iter time: 130.14ms
iter 12 step 0: loss 1.8300, iter time: 126.29ms
iter 13 step 0: loss 1.7750, iter time: 148.74ms
iter 14 step 0: loss 2.5138, iter time: 125.65ms
iter 15 step 1: loss 2.0908, iter time: 259.94ms (optimizer.step)
iter 16 step 1: loss 1.7922, iter time: 130.80ms
iter 17 step 1: loss 1.8289, iter time: 130.33ms
iter 18 step 1: loss 1.8397, iter time: 1107.58ms
iter 19 step 1: loss 2.4835, iter time: 126.96ms
iter 20 step 1: loss 1.7060, iter time: 126.78ms
iter 21 step 1: loss 2.1171, iter time: 126.93ms
iter 22 step 1: loss 2.3113, iter time: 149.04ms
iter 23 step 1: loss 2.0532, iter time: 127.43ms
iter 24 step 1: loss 1.4194, iter time: 125.51ms
iter 25 step 1: loss 1.9506, iter time: 128.39ms
iter 26 step 1: loss 0.9942, iter time: 119.91ms
iter 27 step 1: loss 0.9594, iter time: 148.44ms
iter 28 step 1: loss 2.0504, iter time: 125.90ms
iter 29 step 1: loss 2.1218, iter time: 119.85ms
iter 30 step 1: loss 1.3757, iter time: 126.49ms
iter 31 step 2: loss 1.5776, iter time: 129.43ms (optimizer.step)
...
iter 175 step 11: loss 2.2485, iter time: 147.18ms (optimizer.step)
iter 176 step 11: loss 1.7425, iter time: 120.94ms
iter 177 step 11: loss 2.2430, iter time: 121.42ms
iter 178 step 11: loss 1.9056, iter time: 119.95ms
iter 179 step 11: loss 1.8311, iter time: 151.10ms
iter 180 step 11: loss 2.6305, iter time: 121.09ms
iter 181 step 11: loss 2.5403, iter time: 124.34ms
iter 182 step 11: loss 0.4891, iter time: 129.59ms
iter 183 step 11: loss 1.8725, iter time: 122.12ms
iter 184 step 11: loss 2.0439, iter time: 144.15ms
iter 185 step 11: loss 2.0594, iter time: 120.21ms
iter 186 step 11: loss 2.4870, iter time: 120.96ms
iter 187 step 11: loss 2.8875, iter time: 121.34ms
iter 188 step 11: loss 1.1589, iter time: 145.00ms
iter 189 step 11: loss 2.0229, iter time: 120.92ms
iter 190 step 11: loss 1.4625, iter time: 127.12ms
iter 191 step 12: loss 2.3915, iter time: 130.17ms (optimizer.step)
iter 192 step 12: loss 1.5822, iter time: 127.88ms
iter 193 step 12: loss 2.0108, iter time: 144.54ms
iter 194 step 12: loss 2.2593, iter time: 125.53ms
iter 195 step 12: loss 1.7429, iter time: 121.84ms
iter 196 step 12: loss 2.2730, iter time: 122.11ms
iter 197 step 12: loss 2.1806, iter time: 144.96ms
iter 198 step 12: loss 2.5569, iter time: 126.54ms
iter 199 step 12: loss 1.6543, iter time: 126.85ms
Training time: 37.78s
Memory used: 21.30 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
$ python3 finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf
{'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/home/carlos/lightning/src/lightning/fabric/fabric.py:943: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
rank_zero_warn(
Global seed set to 1337
The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304
...
iter 0 step 0: loss 1.2525, iter time: 1669.85ms
iter 1 step 0: loss 2.8233, iter time: 193.69ms
iter 2 step 0: loss 1.9801, iter time: 176.94ms
iter 3 step 0: loss 2.7705, iter time: 165.77ms
iter 4 step 0: loss 2.7712, iter time: 187.49ms
iter 5 step 0: loss 2.5577, iter time: 168.91ms
iter 6 step 0: loss 1.7638, iter time: 174.00ms
iter 7 step 0: loss 2.1898, iter time: 180.07ms
iter 8 step 0: loss 2.3975, iter time: 185.23ms
iter 9 step 0: loss 2.3578, iter time: 1912.99ms
iter 10 step 0: loss 1.7651, iter time: 192.20ms
iter 11 step 0: loss 2.0168, iter time: 170.73ms
iter 12 step 0: loss 1.9042, iter time: 170.47ms
iter 13 step 0: loss 1.8338, iter time: 174.77ms
iter 14 step 0: loss 2.5964, iter time: 169.38ms
iter 15 step 1: loss 2.1718, iter time: 207.24ms (optimizer.step)
iter 16 step 1: loss 1.8338, iter time: 172.63ms
iter 17 step 1: loss 1.8683, iter time: 174.90ms
iter 18 step 1: loss 1.9124, iter time: 170.26ms
iter 19 step 1: loss 2.5362, iter time: 169.91ms
iter 20 step 1: loss 1.7660, iter time: 176.63ms
iter 21 step 1: loss 2.2030, iter time: 170.23ms
iter 22 step 1: loss 2.3833, iter time: 169.41ms
iter 23 step 1: loss 2.1399, iter time: 173.93ms
iter 24 step 1: loss 1.4729, iter time: 175.55ms
iter 25 step 1: loss 2.0184, iter time: 170.25ms
iter 26 step 1: loss 1.0581, iter time: 170.35ms
iter 27 step 1: loss 0.9963, iter time: 216.12ms
iter 28 step 1: loss 2.1273, iter time: 171.12ms
iter 29 step 1: loss 2.2008, iter time: 170.71ms
iter 30 step 1: loss 1.4045, iter time: 175.52ms
iter 31 step 2: loss 1.6839, iter time: 204.31ms (optimizer.step)
....
iter 175 step 11: loss 2.3226, iter time: 196.31ms (optimizer.step)
iter 176 step 11: loss 1.7935, iter time: 170.30ms
iter 177 step 11: loss 2.3546, iter time: 164.15ms
iter 178 step 11: loss 1.9425, iter time: 172.43ms
iter 179 step 11: loss 1.8658, iter time: 176.41ms
iter 180 step 11: loss 2.6817, iter time: 164.22ms
iter 181 step 11: loss 2.6702, iter time: 163.33ms
iter 182 step 11: loss 0.5291, iter time: 214.21ms
iter 183 step 11: loss 1.9405, iter time: 165.76ms
iter 184 step 11: loss 2.0919, iter time: 170.25ms
iter 185 step 11: loss 2.0953, iter time: 170.95ms
iter 186 step 11: loss 2.5554, iter time: 164.11ms
iter 187 step 11: loss 3.0476, iter time: 164.04ms
iter 188 step 11: loss 1.2179, iter time: 173.06ms
iter 189 step 11: loss 2.1183, iter time: 163.83ms
iter 190 step 11: loss 1.5120, iter time: 174.28ms
iter 191 step 12: loss 2.4990, iter time: 200.10ms (optimizer.step)
iter 192 step 12: loss 1.6157, iter time: 174.68ms
iter 193 step 12: loss 2.0580, iter time: 171.17ms
iter 194 step 12: loss 2.3686, iter time: 165.38ms
iter 195 step 12: loss 1.8248, iter time: 169.93ms
iter 196 step 12: loss 2.3266, iter time: 165.40ms
iter 197 step 12: loss 2.2480, iter time: 165.07ms
iter 198 step 12: loss 2.7058, iter time: 164.83ms
iter 199 step 12: loss 1.6987, iter time: 176.20ms
Training time: 53.85s
Memory used: 14.14 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
I tried running
finetune/lora.py
after applying these changes--- a/finetune/lora.py +++ b/finetune/lora.py @@ -28,9 +28,9 @@ from lit_gpt.utils import ( ) from scripts.prepare_alpaca import generate_prompt -eval_interval = 100 -save_interval = 100 -eval_iters = 100 +eval_interval = 10000 +save_interval = 10000 +eval_iters = 1 eval_max_new_tokens = 100 log_interval = 1 devices = 1 @@ -39,11 +39,11 @@ override_max_seq_length = None # Hyperparameters learning_rate = 3e-4 -batch_size = 128 -micro_batch_size = 4 +batch_size = 16 +micro_batch_size = 1 gradient_accumulation_iters = batch_size // micro_batch_size assert gradient_accumulation_iters > 0 -max_iters = 50000 # train dataset size +max_iters = 200 # train dataset size weight_decay = 0.01 lora_r = 8 lora_alpha = 16 @@ -170,6 +164,10 @@ def train( ) -> None: tokenizer = Tokenizer(checkpoint_dir) max_seq_length, longest_seq_length, longest_seq_ix = get_max_seq_length(train_data) + fabric.print( + f"The longest sequence length in the train data is {longest_seq_length}, the model's maximum sequence length is" + f" {max_seq_length}" + ) model.max_seq_length = max_seq_length validate(fabric, model, val_data, tokenizer, longest_seq_length) # sanity check
And got (on an A100 40GB)
$ python3 finetune/lora.py --precision "bf16-true" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf {'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100} Global seed set to 1337 Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128} Number of trainable parameters: 4,194,304 Number of non trainable parameters: 6,738,415,616 Global seed set to 1337 The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304 ... iter 0 step 0: loss 1.2245, iter time: 1360.27ms iter 1 step 0: loss 2.6754, iter time: 158.58ms iter 2 step 0: loss 1.9239, iter time: 127.87ms iter 3 step 0: loss 2.6522, iter time: 125.67ms iter 4 step 0: loss 2.5902, iter time: 131.11ms iter 5 step 0: loss 2.4788, iter time: 128.62ms iter 6 step 0: loss 1.6953, iter time: 125.60ms iter 7 step 0: loss 2.1224, iter time: 149.03ms iter 8 step 0: loss 2.3261, iter time: 148.92ms iter 9 step 0: loss 2.3034, iter time: 125.98ms iter 10 step 0: loss 1.6996, iter time: 125.11ms iter 11 step 0: loss 1.9687, iter time: 130.14ms iter 12 step 0: loss 1.8300, iter time: 126.29ms iter 13 step 0: loss 1.7750, iter time: 148.74ms iter 14 step 0: loss 2.5138, iter time: 125.65ms iter 15 step 1: loss 2.0908, iter time: 259.94ms (optimizer.step) iter 16 step 1: loss 1.7922, iter time: 130.80ms iter 17 step 1: loss 1.8289, iter time: 130.33ms iter 18 step 1: loss 1.8397, iter time: 1107.58ms iter 19 step 1: loss 2.4835, iter time: 126.96ms iter 20 step 1: loss 1.7060, iter time: 126.78ms iter 21 step 1: loss 2.1171, iter time: 126.93ms iter 22 step 1: loss 2.3113, iter time: 149.04ms iter 23 step 1: loss 2.0532, iter time: 127.43ms iter 24 step 1: loss 1.4194, iter time: 125.51ms iter 25 step 1: loss 1.9506, iter time: 128.39ms iter 26 step 1: loss 0.9942, iter time: 119.91ms iter 27 step 1: loss 0.9594, iter time: 148.44ms iter 28 step 1: loss 2.0504, iter time: 125.90ms iter 29 step 1: loss 2.1218, iter time: 119.85ms iter 30 step 1: loss 1.3757, iter time: 126.49ms iter 31 step 2: loss 1.5776, iter time: 129.43ms (optimizer.step) ... iter 175 step 11: loss 2.2485, iter time: 147.18ms (optimizer.step) iter 176 step 11: loss 1.7425, iter time: 120.94ms iter 177 step 11: loss 2.2430, iter time: 121.42ms iter 178 step 11: loss 1.9056, iter time: 119.95ms iter 179 step 11: loss 1.8311, iter time: 151.10ms iter 180 step 11: loss 2.6305, iter time: 121.09ms iter 181 step 11: loss 2.5403, iter time: 124.34ms iter 182 step 11: loss 0.4891, iter time: 129.59ms iter 183 step 11: loss 1.8725, iter time: 122.12ms iter 184 step 11: loss 2.0439, iter time: 144.15ms iter 185 step 11: loss 2.0594, iter time: 120.21ms iter 186 step 11: loss 2.4870, iter time: 120.96ms iter 187 step 11: loss 2.8875, iter time: 121.34ms iter 188 step 11: loss 1.1589, iter time: 145.00ms iter 189 step 11: loss 2.0229, iter time: 120.92ms iter 190 step 11: loss 1.4625, iter time: 127.12ms iter 191 step 12: loss 2.3915, iter time: 130.17ms (optimizer.step) iter 192 step 12: loss 1.5822, iter time: 127.88ms iter 193 step 12: loss 2.0108, iter time: 144.54ms iter 194 step 12: loss 2.2593, iter time: 125.53ms iter 195 step 12: loss 1.7429, iter time: 121.84ms iter 196 step 12: loss 2.2730, iter time: 122.11ms iter 197 step 12: loss 2.1806, iter time: 144.96ms iter 198 step 12: loss 2.5569, iter time: 126.54ms iter 199 step 12: loss 1.6543, iter time: 126.85ms Training time: 37.78s Memory used: 21.30 GB Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
$ python3 finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf {'eval_interval': 10000, 'save_interval': 10000, 'eval_iters': 1, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 16, 'micro_batch_size': 1, 'gradient_accumulation_iters': 16, 'max_iters': 200, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100} Global seed set to 1337 Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 128, 'rope_n_elem': 128} Number of trainable parameters: 4,194,304 Number of non trainable parameters: 6,738,415,616 /home/carlos/lightning/src/lightning/fabric/fabric.py:943: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`. rank_zero_warn( Global seed set to 1337 The longest sequence length in the train data is 1304, the model's maximum sequence length is 1304 ... iter 0 step 0: loss 1.2525, iter time: 1669.85ms iter 1 step 0: loss 2.8233, iter time: 193.69ms iter 2 step 0: loss 1.9801, iter time: 176.94ms iter 3 step 0: loss 2.7705, iter time: 165.77ms iter 4 step 0: loss 2.7712, iter time: 187.49ms iter 5 step 0: loss 2.5577, iter time: 168.91ms iter 6 step 0: loss 1.7638, iter time: 174.00ms iter 7 step 0: loss 2.1898, iter time: 180.07ms iter 8 step 0: loss 2.3975, iter time: 185.23ms iter 9 step 0: loss 2.3578, iter time: 1912.99ms iter 10 step 0: loss 1.7651, iter time: 192.20ms iter 11 step 0: loss 2.0168, iter time: 170.73ms iter 12 step 0: loss 1.9042, iter time: 170.47ms iter 13 step 0: loss 1.8338, iter time: 174.77ms iter 14 step 0: loss 2.5964, iter time: 169.38ms iter 15 step 1: loss 2.1718, iter time: 207.24ms (optimizer.step) iter 16 step 1: loss 1.8338, iter time: 172.63ms iter 17 step 1: loss 1.8683, iter time: 174.90ms iter 18 step 1: loss 1.9124, iter time: 170.26ms iter 19 step 1: loss 2.5362, iter time: 169.91ms iter 20 step 1: loss 1.7660, iter time: 176.63ms iter 21 step 1: loss 2.2030, iter time: 170.23ms iter 22 step 1: loss 2.3833, iter time: 169.41ms iter 23 step 1: loss 2.1399, iter time: 173.93ms iter 24 step 1: loss 1.4729, iter time: 175.55ms iter 25 step 1: loss 2.0184, iter time: 170.25ms iter 26 step 1: loss 1.0581, iter time: 170.35ms iter 27 step 1: loss 0.9963, iter time: 216.12ms iter 28 step 1: loss 2.1273, iter time: 171.12ms iter 29 step 1: loss 2.2008, iter time: 170.71ms iter 30 step 1: loss 1.4045, iter time: 175.52ms iter 31 step 2: loss 1.6839, iter time: 204.31ms (optimizer.step) .... iter 175 step 11: loss 2.3226, iter time: 196.31ms (optimizer.step) iter 176 step 11: loss 1.7935, iter time: 170.30ms iter 177 step 11: loss 2.3546, iter time: 164.15ms iter 178 step 11: loss 1.9425, iter time: 172.43ms iter 179 step 11: loss 1.8658, iter time: 176.41ms iter 180 step 11: loss 2.6817, iter time: 164.22ms iter 181 step 11: loss 2.6702, iter time: 163.33ms iter 182 step 11: loss 0.5291, iter time: 214.21ms iter 183 step 11: loss 1.9405, iter time: 165.76ms iter 184 step 11: loss 2.0919, iter time: 170.25ms iter 185 step 11: loss 2.0953, iter time: 170.95ms iter 186 step 11: loss 2.5554, iter time: 164.11ms iter 187 step 11: loss 3.0476, iter time: 164.04ms iter 188 step 11: loss 1.2179, iter time: 173.06ms iter 189 step 11: loss 2.1183, iter time: 163.83ms iter 190 step 11: loss 1.5120, iter time: 174.28ms iter 191 step 12: loss 2.4990, iter time: 200.10ms (optimizer.step) iter 192 step 12: loss 1.6157, iter time: 174.68ms iter 193 step 12: loss 2.0580, iter time: 171.17ms iter 194 step 12: loss 2.3686, iter time: 165.38ms iter 195 step 12: loss 1.8248, iter time: 169.93ms iter 196 step 12: loss 2.3266, iter time: 165.40ms iter 197 step 12: loss 2.2480, iter time: 165.07ms iter 198 step 12: loss 2.7058, iter time: 164.83ms iter 199 step 12: loss 1.6987, iter time: 176.20ms Training time: 53.85s Memory used: 14.14 GB Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
Okay but what about RTX 3090 24 GB ? Its not working there even with these changes.
For memory issues, please refer to the OOM guide: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md
Hi, I am getting OOM when I try to finetune Llama-2-7b-hf.
I have already tried all solutions that were mentioned in the OOM guide:
I have tried all possible flags i.e -
I have tried with
batch_size: 16 and micro_batch_size:1
I am using using a single gpu so I cant really use "sharding accross multiple gpus"
Also, I have tried the exact same configuration and code on RTX 6000 ADA (48 GB VRAM) and even that kept running out of memory.
It would be very helpful if you can help me with this problem. Thank you