Enabling Multi-GPU Inferencing

Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.

https://lightning.ai

Apache License 2.0

6.95k stars 733 forks source link

Enabling Multi-GPU Inferencing #469

Closed babytdream closed 4 months ago

babytdream commented 9 months ago

This erro appears in other project when I use 16 A10(16 23G) to inference Llama2-70B：

I ask many people to solve this problem,but failed. I know 8 gpu can work it! But I need to increase the prompt of llama2, the 8 GPU is not enough! Do you have some ideas？In this project，can you solve it？Thanks！

carmocca commented 9 months ago

What command did you run? Did you make any modifications to the scripts?

Do you have 16 GPUs in one machine or 2 machines with 8 GPUs each?

babytdream commented 9 months ago

@carmocca I have 16 GPUs in one machine, here is gpu ：

The commanad is : First, I have downloaded llama2-70b-chat and converted it to huggingface format using this script. Then I used below command, as mentioned in download_llama_2.md python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Llama-2-70b-chat-hf

Then I use this command: python chat/base.py --checkpoint_dir checkpoints/meta-llama/Llama-2-70b-chat-hf

The error is: log1.txt

I think the model file lit_model.pth should be split. And it should insert torch.nn.DataParallel to support multi-gpus.Do you have some ideas, thanks!

carmocca commented 9 months ago

Oh yes, the chat script doesnt support multi-gpu at the moment: https://github.com/Lightning-AI/lit-gpt/blob/e83c068afc13dd84fd628a8da235cfcfa49a1193/chat/base.py#L149

However, you can use the generate/base.py script which does support it: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/inference.md#run-a-large-model-on-multiple-smaller-devices

babytdream commented 9 months ago

@carmocca Hi, I follow your command: python generate/base.py --checkpoint_dir /data/model/Llama-2-70b-chat-hf/ --strategy fsdp --devices 16

But it appears an error: log1.txt Is there a bug？

carmocca commented 9 months ago

Can you also do https://github.com/Lightning-AI/lit-gpt/issues/432#issuecomment-1682259981? This is a known issue from a recent Fabric update

babytdream commented 9 months ago

@carmocca Hi, I follow your command: python generate/base.py --checkpoint_dir /data/model/Llama-2-70b-chat-hf/ --strategy fsdp --devices 12 But it runs more than 1 hour, this doesn't seem normal.There are logs:

Loading model '/data/model/Llama-2-70b-chat-hf/lit_model.pth' with {'org': 'meta-llama', 'name': 'Llama-2-70b-chat-hf', 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 80, 'n_head': 64, 'n_embd': 8192, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, '_norm_class': 'RMSNorm', 'norm_eps': 1e-05, '_mlp_class': 'LLaMAMLP', 'intermediate_size': 28672, 'condense_ratio': 1}
Time to instantiate model: 0.05 seconds.
Time to load the model weights: 546.27 seconds.
[rank: 11] Global seed set to 1234
[rank: 1] Global seed set to 1234
[rank: 0] Global seed set to 1234
[rank: 5] Global seed set to 1234
[rank: 3] Global seed set to 123

rasbt commented 9 months ago

In my opinion, also the "546.27 seconds" loading time seems super long. I think it should not be longer than a minute usually. Do you have the weights on an S3 bucket or sth like that by chance?

rasbt commented 9 months ago

Oh sorry, I just compared it to the 7B model which took ~1 min to load. I observed it taking longer (10 min) when I had the .pth file on an S3 bucket once.

I don't know the 70B loading times off the top of my head, sorry!

babytdream commented 9 months ago

Same question like #456

carmocca commented 4 months ago

Multi-GPU inference is now supported: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/inference.md#run-a-large-model-on-multiple-smaller-devices