When using the FSDP strategy with HYBRID SHARD set, the loss behaves as if only one node is training. When it is set to FULL_SHARD/etc the loss drops as expected when more nodes are added and batch size is left constant. I have verified NCCL connections are working correctly as everything behaves as expected when using FULL_SHARD.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
```
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```
Bug description
When using the FSDP strategy with HYBRID SHARD set, the loss behaves as if only one node is training. When it is set to FULL_SHARD/etc the loss drops as expected when more nodes are added and batch size is left constant. I have verified NCCL connections are working correctly as everything behaves as expected when using FULL_SHARD.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```More info
No response