Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.47k stars 1.04k forks source link

Floating point exception (core dumped) Llama-2-13b #443

Closed KOVVURISATYANARAYANAREDDY closed 1 year ago

KOVVURISATYANARAYANAREDDY commented 1 year ago

Hello,

Firstly thank you for such great code, I really appreciate the work.

I am trying to run Llama-2-13b-chat-hf/llama-2-13b-hf models. I followed the same procedure as mentioned here to download model from huggingface and convert to lit-gpt format.

Then i prepared the alpaca data using /scripts/prepare_alpaca.py

Made small change in finetune/lora.py inside get_batch function.

    max_len = 1500   #max(len(s) for s in input_ids) if fabric.device.type != "xla" else longest_seq_length

And then i ran finetune/lora.py code with prepared alpaca dataset and checkpoints/meta-llama/Llama-2-13b-hf/ checkpoint.

I am having 8 A100 40GB GPUs. Trying to run on multiple GPUs. device=4

i am facing following error which i am not facing with llama-7b model.

      (litgpt) /lit-gpt$ python finetune/lora.py
              /home/ubuntu/anaconda3/envs/litgpt/lib/python3.9/site-packages/pydantic/_migration.py:282: UserWarning: `pydantic.utils:Representation` has been removed. We are importing from `pydantic.v1.utils:Representation` instead.See the migration guide for more details: https://docs.pydantic.dev/latest/migration/
                warnings.warn(
              {'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'log_interval': 1, 'devices': 2, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 4, 'gradient_accumulation_iters': 32, 'max_iters': 50000, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
              Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
              /home/ubuntu/anaconda3/envs/litgpt/lib/python3.9/site-packages/pydantic/_migration.py:282: UserWarning: `pydantic.utils:Representation` has been removed. We are importing from `pydantic.v1.utils:Representation` instead.See the migration guide for more details: https://docs.pydantic.dev/latest/migration/
                warnings.warn(
              Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
              [rank: 1] Global seed set to 1337
              ----------------------------------------------------------------------------------------------------
              distributed_backend=nccl
              All distributed processes registered. Starting with 2 processes
              ----------------------------------------------------------------------------------------------------

              [rank: 0] Global seed set to 1337
              Loading model '../../lit-gpt/checkpoints/meta-llama/Llama-2-13b-chat-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-13b-chat-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 40, 'n_head': 40, 'n_embd': 5120, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 40, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 13824, 'condense_ratio': 1, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False}
              Number of trainable parameters: 6,553,600
              Number of non trainable parameters: 13,015,864,320
              [rank: 1] Global seed set to 1338
              [rank: 0] Global seed set to 1337
              Validating ...
              Floating point exception (core dumped)

Also i have some questions how can we apply this lit-gpt method to other models like Starcoder?. PLease suggest steps in doing.

Please Suggest changes, Thank you in advance.

KOVVURISATYANARAYANAREDDY commented 1 year ago
              F.conv1d(input, weight, groups=sum(self.enable_lora))

litgpt/lora.py inside LoRAQKVLinear class and conv1d method caused this issue.

something wrong with environment i guess. Creating a separate nvidia pytorch docker with nightly version resolved this issue, but now the FLOPS is less.

carmocca commented 1 year ago

This must be an issue with PyTorch. There's nothing we can do about it here. If you save the inputs passed to the function and are able to reproduce it after, the PyTorch team might be able to help you after opening an issue in their repo.

KOVVURISATYANARAYANAREDDY commented 1 year ago

Thank you @carmocca for the suggestion.

yang-xy20 commented 10 months ago

I met the same problem when loading codegen25-7b-instruct. Do you solve it? @KOVVURISATYANARAYANAREDDY

KOVVURISATYANARAYANAREDDY commented 10 months ago

Try with different Pytorch version or use nvidia Pytorch docker that worked I guess. Can't recall what worked.

yang-xy20 commented 10 months ago

Try with different Pytorch version or use nvidia Pytorch docker that worked I guess. Can't recall what worked.

Thanks a lot. It does work.

KOVVURISATYANARAYANAREDDY commented 10 months ago

Please post the procedure you followed, so that others can benefit. I forget to post when I did, and i don't remember exactly. Thank you.