Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.85k stars 726 forks source link

Why FSDPStrategy is so slow-down when I use multi-machine #1369

Open Graduo opened 2 weeks ago

Graduo commented 2 weeks ago

Hello
I was struggling to train a 1.5B LlaMA ,but I observed an unexpectedly slow-down when using FSDP strategy in 2 devices. ''' FLOPs not found for 'NVIDIA H800' Measured TFLOPs: 2539.13 Epoch 1 | iter 16 step 1 | loss train: 8.515, val: n/a | iter time: 26133.73 ms (step) remaining time: 909 days, 3:20:37 Epoch 1 | iter 32 step 2 | loss train: 8.509, val: n/a | iter time: 26446.05 ms (step) remaining time: 635 days, 12:08:15 Epoch 1 | iter 48 step 3 | loss train: 8.491, val: n/a | iter time: 26204.95 ms (step) remaining time: 543 days, 2:07:38 Epoch 1 | iter 64 step 4 | loss train: 8.472, val: n/a | iter time: 26227.60 ms (step) remaining time: 496 days, 22:22:41 Epoch 1 | iter 80 step 5 | loss train: 8.492, val: n/a | iter time: 26297.45 ms (step) remaining time: 469 days, 9:35:18 Epoch 1 | iter 96 step 6 | loss train: 8.395, val: n/a | iter time: 25975.68 ms (step) remaining time: 450 days, 10:46:30 Epoch 1 | iter 112 step 7 | loss train: 8.383, val: n/a | iter time: 26152.08 ms (step) remaining time: 437 days, 4:40:59 Epoch 1 | iter 128 step 8 | loss train: 8.314, val: n/a | iter time: 26192.78 ms (step) remaining time: 427 days, 22:27:04 Epoch 1 | iter 144 step 9 | loss train: 8.411, val: n/a | iter time: 26267.13 ms (step) remaining time: 420 days, 6:28:56 ''' And when train it in single machine , the iter time is around 700ms. could I get any idea about the reason and how can I fix it? Thank you!

lantiga commented 2 weeks ago

Hi, can you post the CLI args or code you are using? Also this is with two machines and 8 GPUs per machine?

lantiga commented 2 weeks ago

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Graduo commented 2 weeks ago

Hi, can you post the CLI args or code you are using? Also this is with two machines and 8 GPUs per machine?

HI thanks for your prompt reply! Yes , I use two machines and 8 GPUs per machine. I just use args like `fabric run --node-rank=0 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml

fabric run --node-rank=1 --main-address=ip1 --accelerator=cuda --devices=8 --num-nodes=2 litgpt/pretrain_multinode_myllama.py --config config_hub/pretrain/myllama.yaml`

I once tried using 'litgpt run', but it did not work successfully. Based on the suggestions, I changed it to 'fabric run' .And the code about strategy

strategy = FSDPStrategy(auto_wrap_policy={Block}, state_dict_type="full", sharding_strategy="HYBRID_SHARD")

Graduo commented 2 weeks ago

Just to confirm: are you running the pretraining command?

Maybe try to comment this line out: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py#L174

We have bumped into issues with PyTorch 2.2 and torch.compile recently, let's take this variable out of the equation.

Yeah,I am running the pretraining command