Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.51k stars 3.39k forks source link

FSDP full state dict mangles fsspec path #20406

Open oceanusxiv opened 2 weeks ago

oceanusxiv commented 2 weeks ago

Bug description

In FSDPStrategy.save_checkpoint, the filepath variable is transformed via https://github.com/Lightning-AI/pytorch-lightning/blob/3627c5bfac704d44c0d055a2cdf6f3f9e3f9e8c1/src/lightning/pytorch/strategies/fsdp.py#L562 This only makes sense if doing sharded checkpointing, and in fact mangles any legitimate fsspec path that is passed in.

When self._state_dict_type == "full",

super().save_checkpoint(checkpoint=checkpoint, filepath=path)

is called, using the normal CheckpointIO workflow, but with the mangled path.

The expected behavior should be that if the user chooses full state dict type, CheckpointIO and remote paths should work as usual, but currently full state dict checkpoints cannot be saved to remote paths.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

trainer = L.Trainer(
        strategy="fsdp"
        default_root_dir="s3://example/path"
    )

trainer.fit(model=...)

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```

More info

No response

lantiga commented 1 week ago

Thank you @oceanusxiv if you could send a complete repro it would speed up the fix