OOM with bf16-true, Quantization, for long context length.

KOVVURISATYANARAYANAREDDY commented 9 months ago

Hello.

I tried to install everything mentioned from git cloning to installing all requirements, except flash attentions.

I have an A100 40GB machine CUDA 11.0

Now downloaded the llama-2-7B checkpoints and prepared.

also prepared alpaca dataset. with llama-2-7B using

python scripts/prepare_alpaca.py

Now i changes the finetune.lora.py. Inside get_batch.

max_len = 4000 # max(len(s) for s in input_ids) if fabric.device.type != "xla" else longest_seq_length

Using the Below command runs successfully on single GPU. with 4000 context length

python finetune/lora.py --precision 'bf16-true'

Using Below command with quantization is giving OOM.

python finetune/lora.py --precision 'bf16-true' --quantize 'bnb.nf4-dq'

This is suspicious. I think if just "bf16-true" is running fine, then "bf16-true" along with quantization should also work even with less memory.

Can someone suggest, what went wrong. What am i missing?

carmocca commented 9 months ago

I agree that this shouldn't happen. Have you double checked that you called the script as expected and that no other program was running on the GPU? Do you have logs that you can share?

KOVVURISATYANARAYANAREDDY commented 9 months ago

Hello @carmocca , Thanks for your response.

I Check there is no other process running on the GPU. using watch nvidia-smi

i passed the model and dataset in the finetune/lora.py script itself.

def setup(
    data_dir: Path = Path("data/alpaca"),
    checkpoint_dir: Path = Path("checkpoints/meta-llama/Llama-2-7b-hf/"),
    out_dir: Path = Path("out/lora/alpaca"),
    precision: Optional[str] = None,
    tpu: bool = False,
    quantize: Optional[Literal["bnb.nf4", "bnb.nf4-dq", "bnb.fp4", "bnb.fp4-dq"]] = None,
):

Here is the logs for python finetune/lora.py --precision "bf16-true --quantize "bnb.nf4-dq""

/home/jupyter/Satya/lit-gpt# python finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4-dq"
/opt/conda/envs/litgpt/lib/python3.9/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.utils:Representation` has been removed. We are importing from `pydantic.v1.utils:Representation` instead.See the migration guide for more details: https://docs.pydantic.dev/latest/migration/
  warnings.warn(
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 1, 'gradient_accumulation_iters': 128, 'max_iters': 50000, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py:943: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
  rank_zero_warn(
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
Since you like to watch fantasy movies, I'm recommending the movie "The Lord Of The Rings". This movie is full of fantasy and adventure, and it has been a blockbuster in the fantasy genre. I'm sure you will enjoy the movie.

### Instruction:
Write a post on your social media account that shares the latest company news.

### Response:
A couple of days ago, Twitter has just
Estimated TFLOPs: 154.49
Measured TFLOPs: 37.01
inputs:  torch.Size([1, 4000])
iter 0 step 0: loss 0.4062, iter time: 1415.67ms
inputs:  torch.Size([1, 4000])
Traceback (most recent call last):
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 341, in <module>
    CLI(setup)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/jsonargparse/_cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/jsonargparse/_cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 95, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 150, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 213, in train
    logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=128)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/wrappers.py", line 117, in forward
    output = self._forward_module(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/lora.py", line 525, in forward
    x, *_ = block(x, (cos, sin), max_seq_length)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/model.py", line 172, in forward
    x = x + self.mlp(self.norm_2(x))
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/model.py", line 293, in forward
    x_fc_2 = self.fc_2(x)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/lora.py", line 146, in forward
    pretrained = self.linear(x)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.59 GiB of which 23.19 MiB is free. Process 69409 has 39.56 GiB memory in use. Of the allocated memory 35.93 GiB is allocated by PyTorch, and 2.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Now using only bf16-true. python finetune/lora.py --precision 'bf16-true'

It is using a peak memory of 38.8 GB single GPU.

/home/jupyter/Satya/lit-gpt# python finetune/lora.py --precision "bf16-true"
/opt/conda/envs/litgpt/lib/python3.9/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.utils:Representation` has been removed. We are importing from `pydantic.v1.utils:Representation` instead.See the migration guide for more details: https://docs.pydantic.dev/latest/migration/
  warnings.warn(
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 1, 'gradient_accumulation_iters': 128, 'max_iters': 50000, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
I recommend The Little Prince because it is an animated movie based on one of the most famous novels in history. The movie is very sweet and speaks about how to be adult and live life to the fullest. The Little Prince movie is a great way to bond with someone and become closer to them.
 use Test::More;

use_ok( 'JSON::XS' );

ok( my $obj = JSON::XS->new->utf
Estimated TFLOPs: 154.49
Measured TFLOPs: 37.01
inputs:  torch.Size([1, 4000])
iter 0 step 0: loss 0.3955, iter time: 1097.59ms
inputs:  torch.Size([1, 4000])
iter 1 step 0: loss 0.0520, iter time: 763.49ms
inputs:  torch.Size([1, 4000])
iter 2 step 0: loss 0.0407, iter time: 762.60ms
inputs:  torch.Size([1, 4000])
iter 3 step 0: loss 0.0652, iter time: 762.58ms
inputs:  torch.Size([1, 4000])
iter 4 step 0: loss 0.0390, iter time: 762.72ms
inputs:  torch.Size([1, 4000])
iter 5 step 0: loss 0.0788, iter time: 762.45ms
inputs:  torch.Size([1, 4000])
iter 6 step 0: loss 0.0484, iter time: 762.30ms
inputs:  torch.Size([1, 4000])
iter 7 step 0: loss 0.0604, iter time: 762.83ms
inputs:  torch.Size([1, 4000])
iter 8 step 0: loss 0.0650, iter time: 762.41ms
inputs:  torch.Size([1, 4000])
iter 9 step 0: loss 0.0688, iter time: 762.82ms
inputs:  torch.Size([1, 4000])
iter 10 step 0: loss 0.1087, iter time: 762.61ms
inputs:  torch.Size([1, 4000])
iter 11 step 0: loss 0.1185, iter time: 762.47ms
inputs:  torch.Size([1, 4000])
iter 12 step 0: loss 0.0862, iter time: 762.33ms
inputs:  torch.Size([1, 4000])
iter 13 step 0: loss 0.0768, iter time: 762.39ms
inputs:  torch.Size([1, 4000])
iter 14 step 0: loss 0.0974, iter time: 762.38ms
inputs:  torch.Size([1, 4000])
iter 15 step 0: loss 0.0483, iter time: 762.35ms
inputs:  torch.Size([1, 4000])
iter 16 step 0: loss 0.0474, iter time: 762.50ms
inputs:  torch.Size([1, 4000])
iter 17 step 0: loss 0.0506, iter time: 763.22ms
inputs:  torch.Size([1, 4000])
iter 18 step 0: loss 0.0549, iter time: 762.22ms
inputs:  torch.Size([1, 4000])
iter 19 step 0: loss 0.1087, iter time: 762.53ms
inputs:  torch.Size([1, 4000])
iter 20 step 0: loss 0.0626, iter time: 762.96ms
inputs:  torch.Size([1, 4000])
iter 21 step 0: loss 0.0433, iter time: 762.52ms
inputs:  torch.Size([1, 4000])
iter 22 step 0: loss 0.0832, iter time: 762.49ms
inputs:  torch.Size([1, 4000])
iter 23 step 0: loss 0.0340, iter time: 762.33ms
inputs:  torch.Size([1, 4000])
iter 24 step 0: loss 0.0552, iter time: 762.40ms
inputs:  torch.Size([1, 4000])
iter 25 step 0: loss 0.0934, iter time: 762.39ms
inputs:  torch.Size([1, 4000])
iter 26 step 0: loss 0.0917, iter time: 762.40ms
inputs:  torch.Size([1, 4000])
iter 27 step 0: loss 0.0538, iter time: 762.46ms
inputs:  torch.Size([1, 4000])
iter 28 step 0: loss 0.0438, iter time: 762.31ms
inputs:  torch.Size([1, 4000])
iter 29 step 0: loss 0.0848, iter time: 762.58ms
inputs:  torch.Size([1, 4000])
iter 30 step 0: loss 0.0695, iter time: 762.44ms
inputs:  torch.Size([1, 4000])

The above all are done on CUDA 11.0, no flash attentions.

Please suggest, Thank you.

carmocca commented 9 months ago

I don't have a solution for you right now. I would need to reproduce the issue. Have you seen this before @rasbt?

rasbt commented 9 months ago

Hm, I haven't had any issues with that recently. But there have been a couple of changes in the last few days. Hm.

I assume it's the same issue with "bnb.nf4" (instead of "bnb.nf4-dq")?

Another thing that comes to mind: Maybe it's perhaps an older version of bitsandbytes? I remember there were some issues there that have been fixed at some point.

KOVVURISATYANARAYANAREDDY commented 9 months ago

Hello, @carmocca, @rasbt.

I have latest version of bitsandbytes==0.41.1.

Also i am getting OOM with

python finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4"

python python finetune/lora.py --quantize "bnb.nf4-dq"

Thank you.

rasbt commented 9 months ago

@KOVVURISATYANARAYANAREDDY I just tried it and it works fine for me with Llama 2 7B:

(qlora) sebastian@hyperplane1:~/Developer/prs/debug/lit-gpt$ python finetune/lora.py --precision "bf16-true" --quantize "bnb.nf4" --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 1, 'gradient_accumulation_iters': 128, 'max_iters': 5, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
Since you like to watch fantasy movies, I recommend you to watch the movie "The Lord Of The Rings". This movie is full of fantasy and adventure. It is based on the book written by the English author J. R. R. Tolkien and he wrote the book from his imagination. The movie includes many fantasy elements that will entertain you. The movie will take you to magical world and you will enjoy the movie sitting in the comfortable chair.

Estimated TFLOPs: 154.49
Measured TFLOPs: 37.01
iter 0 step 0: loss 1.2486, iter time: 1735.77ms
iter 1 step 0: loss 2.4488, iter time: 211.41ms
iter 2 step 0: loss 2.9167, iter time: 167.97ms
iter 3 step 0: loss 2.0262, iter time: 175.39ms
iter 4 step 0: loss 2.9811, iter time: 164.14ms
Training time: 21.19s
Memory used: 14.15 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'

Have you double-checked that you lowered the micro_batch_size to 1 or 2?

KOVVURISATYANARAYANAREDDY commented 9 months ago

Hey @rasbt.

I set micro_batch_size to 1.
i dont have flash-attn installed.

i set max_length = 4000. Now i changes the finetune.lora.py. Inside get_batch.

max_len = 4000 # max(len(s) for s in input_ids) if fabric.device.type != "xla" else longest_seq_length

with these settings i am getting the OOM.

Thank you.

rasbt commented 9 months ago

Which dataset are you using? I was using Alpaca, which has relatively short contexts

KOVVURISATYANARAYANAREDDY commented 9 months ago

I am using same Alpaca, but i also have custom dataset, I just wanted to test llama-2-7B model using full context length. i am trying to pad the content until we are getting 4000 context length. in get_batch. Thank you.

rasbt commented 9 months ago

Just for debugging purposes, what is your memory usage if you use the default lora.py script with microbatch size 1 on Alpaca so that we can compare to my results above and see if there's something off.

KOVVURISATYANARAYANAREDDY commented 9 months ago

(litgpt) root@b313af4d8107:/home/jupyter/Satya/lit-gpt# python finetune/lora.py  --precision "bf16-true" --quantize "bnb.nf4"
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 1, 'gradient_accumulation_iters': 128, 'max_iters': 10, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py:945: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
  rank_zero_warn(
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
Since you like to watch fantasy movies, I recommend you to watch the movie "The Lord Of The Rings". This movie is full of fantasy and adventure. It is based on the book written by the English author J. R. R. Tolkien and he wrote the book from his imagination. The movie includes many fantasy elements that will entertain you. The movie will take you to magical world and you will enjoy the movie sitting in the comfortable chair.

Estimated TFLOPs: 154.49
Measured TFLOPs: 37.01
iter 0 step 0: loss 1.2479, iter time: 550.33ms
iter 1 step 0: loss 2.4513, iter time: 285.48ms
iter 2 step 0: loss 2.9095, iter time: 176.75ms
iter 3 step 0: loss 2.0259, iter time: 181.92ms
iter 4 step 0: loss 2.9700, iter time: 173.13ms
iter 5 step 0: loss 2.2739, iter time: 181.60ms
iter 6 step 0: loss 2.5510, iter time: 173.85ms
iter 7 step 0: loss 2.2994, iter time: 416.43ms
iter 8 step 0: loss 2.6874, iter time: 174.99ms
iter 9 step 0: loss 2.4119, iter time: 176.81ms
Training time: 23.69s
Memory used: 14.15 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'
(litgpt) root@b313af4d8107:/home/jupyter/Satya/lit-gpt#

Even though it says Memory used: 14.15 GB. But i see 16040 MiB usage in first gpu. using watch nvidia-sim

rasbt commented 9 months ago

Interesting, so the problem is the longer contexts then?

KOVVURISATYANARAYANAREDDY commented 9 months ago

I am not sure about the reason.

If you can change the below line and check

in finetune/lora.py in the function get_batch:

max_len = 4000 # max(len(s) for s in input_ids) if fabric.device.type != "xla" else longest_seq_length

and run same command

python finetune/lora.py  --precision "bf16-true" --quantize "bnb.nf4"

(litgpt) root@b313af4d8107:/home/jupyter/Satya/lit-gpt# python finetune/lora.py  --precision "bf16-true" --quantize "bnb.nf4"
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 1, 'gradient_accumulation_iters': 128, 'max_iters': 10, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py:945: PossibleUserWarning: The model passed to `Fabric.setup()` has 66 parameters on different devices (for example 'transformer.wte.weight' on cuda:0 and 'lm_head.linear.weight' on cpu). Since `move_to_device=True`, all parameters will be moved to the new device. If this is not desired, set `Fabric.setup(..., move_to_device=False)`.
  rank_zero_warn(
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
Since you like to watch fantasy movies, I recommend you to watch the movie "The Lord Of The Rings". This movie is full of fantasy and adventure. It is based on the book written by the English author J. R. R. Tolkien and he wrote the book from his imagination. The movie includes many fantasy elements that will entertain you. The movie will take you to magical world and you will enjoy the movie sitting in the comfortable chair.

Estimated TFLOPs: 154.49
Measured TFLOPs: 37.01
iter 0 step 0: loss 0.4064, iter time: 1139.68ms
Traceback (most recent call last):
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 341, in <module>
    CLI(setup)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/jsonargparse/_cli.py", line 181, in _run_component
    return component(**cfg)
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 95, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 836, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 922, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 927, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 150, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/home/jupyter/Satya/lit-gpt/finetune/lora.py", line 213, in train
    logits = model(input_ids, max_seq_length=max_seq_length, lm_head_chunk_size=128)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/lightning/fabric/wrappers.py", line 118, in forward
    output = self._forward_module(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/lora.py", line 525, in forward
    x, *_ = block(x, (cos, sin), max_seq_length)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/model.py", line 161, in forward
    h, new_kv_cache = self.attn(n_1, rope, max_seq_length, mask, input_pos, kv_cache)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/model.py", line 198, in forward
    qkv = self.attn(x)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/litgpt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/lora.py", line 385, in forward
    lora = self.zero_pad(after_B) * self.scaling  # (64, 64, 256) after zero_pad (64, 64, 384)
  File "/home/jupyter/Satya/lit-gpt/lit_gpt/lora.py", line 291, in zero_pad
    result = x.new_zeros((*x.shape[:-1], self.linear.out_features))  # (64, 64, 384)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 94.00 MiB. GPU 0 has a total capacty of 39.59 GiB of which 47.19 MiB is free. Process 24302 has 39.54 GiB memory in use. Of the allocated memory 35.81 GiB is allocated by PyTorch, and 2.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Giving OOM

and also just with

python finetune/lora.py  --precision "bf16-true".

(litgpt) root@b313af4d8107:/home/jupyter/Satya/lit-gpt# python finetune/lora.py  --precision "bf16-true"
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 1, 'gradient_accumulation_iters': 128, 'max_iters': 10, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Global seed set to 1337
Loading model 'checkpoints/meta-llama/Llama-2-7b-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-7b-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 11008, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
Number of trainable parameters: 4,194,304
Number of non trainable parameters: 6,738,415,616
Global seed set to 1337
Validating ...
Recommend a movie for me to watch during the weekend and explain the reason.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:
I recommend The Little Prince because it is an animated movie based on one of the most famous novels in history. The movie is very sweet and speaks about how to be adult and live life to the fullest. The Little Prince movie is a great way to bond with someone and become closer to them.
 use Test::More;

use_ok( 'JSON::XS' );

ok( my $obj = JSON::XS->new->utf
Estimated TFLOPs: 154.49
Measured TFLOPs: 37.01
iter 0 step 0: loss 0.3955, iter time: 971.22ms
iter 1 step 0: loss 0.0520, iter time: 765.30ms
iter 2 step 0: loss 0.0407, iter time: 759.22ms
iter 3 step 0: loss 0.0652, iter time: 759.22ms
iter 4 step 0: loss 0.0390, iter time: 759.33ms
iter 5 step 0: loss 0.0788, iter time: 759.34ms
iter 6 step 0: loss 0.0484, iter time: 759.31ms
iter 7 step 0: loss 0.0604, iter time: 759.57ms
iter 8 step 0: loss 0.0650, iter time: 759.64ms
iter 9 step 0: loss 0.0688, iter time: 759.61ms
Training time: 58.42s
Memory used: 37.47 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'

Running fine.

This is what i find.

May be i am missing something. I am snot sure.

rasbt commented 9 months ago

I see. Hm, that's interesting. So on small contexts, bnb.nf4 performs better, but on longer contexts bnb.nf4 performs worse?

KOVVURISATYANARAYANAREDDY commented 9 months ago

I see. Hm, that's interesting. So on small contexts, bnb.nf4 performs better, but on longer contexts bnb.nf4 performs worse?

I guess so.

carmocca commented 9 months ago

Could you do an experiment where you run with both short and long context with precision and precision+quantization for the same model? And report the results

KOVVURISATYANARAYANAREDDY commented 9 months ago

Hello @carmocca, Here are my observations.

Every experiment i ran only with micro_batch_size = 1 model is Llama-2-7b-hf Long context length is 4000. Memory Utilization observed - observed using watch nvidia-smi

context length	precision	Memory displayed at end of run	Memory Utilization observed
	--precision "bf16-true"	21.32	22.6
	--quantize "bnb.nf4"	19.62	21.5
Short	--quantize "bnb.nf4-dq"	19.32	21.2
	--precision "bf16-true" --quantize "bnb.nf4"	14.15	16.1
	--precision "bf16-true" --quantize "bnb.nf4-dq"	13.84	15.6
--------------	-------------------------	--------------	--------------------
	--precision "bf16-true"	37.47	38.8
	--quantize "bnb.nf4"	OOM
Long	--quantize "bnb.nf4-dq"	OOM
	--precision "bf16-true" --quantize "bnb.nf4"	OOM
	--precision "bf16-true" --quantize "bnb.nf4-dq"	OOM

Hope this helps. Thank you.

rasbt commented 8 months ago

Referencing #501 because there seems to be a similar issue when using larger microbatch sizes.

Lightning-AI / litgpt

OOM with bf16-true, Quantization, for long context length. #477