Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.54k stars 3.39k forks source link

FSDP with HYBRID_SHARD loss doesn't improve with more nodes #20385

Open zaptrem opened 1 month ago

zaptrem commented 1 month ago

Bug description

When using the FSDP strategy with HYBRID SHARD set, the loss behaves as if only one node is training. When it is set to FULL_SHARD/etc the loss drops as expected when more nodes are added and batch size is left constant. I have verified NCCL connections are working correctly as everything behaves as expected when using FULL_SHARD.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```

More info

No response

lantiga commented 2 weeks ago

Can you please provide a minimal repro scripts with expected results?