FSDP with HYBRID_SHARD loss doesn't improve with more nodes

Bug description

When using the FSDP strategy with HYBRID SHARD set, the loss behaves as if only one node is training. When it is set to FULL_SHARD/etc the loss drops as expected when more nodes are added and batch size is left constant. I have verified NCCL connections are working correctly as everything behaves as expected when using FULL_SHARD.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```

More info

No response

Lightning-AI / pytorch-lightning