Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.15k stars 3.37k forks source link

Support saving and loading remote paths with FSDP #18786

Open schmidt-ai opened 1 year ago

schmidt-ai commented 1 year ago

Bug description

FSDPStrategy.load_checkpoint casts checkpoint_path to a pathlib.Path here. This will bork URIs, such as cloud checkpoint paths, e.g. s3://....

Example:

from pathlib import Path

checkpoint_path = "s3://asd/asd"
assert Path(checkpoint_path).as_posix() == checkpoint_path

NOTE: I am reporting this merely by looking at the source code; I have yet to confirm this with a test.

What version are you seeing the problem on?

master

How to reproduce the bug

from lightning.pytorch.strategies import FSDPStrategy

FSDPStrategy(...).load_checkpoint("s3://my/checkpoint")

Error messages and logs

I believe this exception will be raised.

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @borda @awaelchli @carmocca

awaelchli commented 1 year ago

@schmidt-ai It isn't documented, so it's considered not supported and so it's not a bug. I'm happy to convert it to a feature request though.

Our goal here was to build minimally and get the checkpoint loading and saving done well for FSDP. Limiting the scope to the local filesystem is not totally unreasonable in my opinion. It should be possible to extend the logic to make use of fsspec to cover the other cases.

schmidt-ai commented 1 year ago

@awaelchli totally makes sense! I was debating whether to file this as a feature request or bug, my apologies!

awaelchli commented 1 year ago

No need to apologize. The request is very good. We know we have some gaps in Lightning in general regarding remote filesystem support and it's good to document them. This reminds me that I need to get back to this one https://github.com/Lightning-AI/lightning/pull/18252 where I got hung up on working on test cases and then other priorities came up. So in fact, it's me who needs to apologize 😄

carmocca commented 7 months ago

PyTorch is also considering supporting fsspec paths automatically: https://github.com/pytorch/pytorch/issues/118036

xin-w8023 commented 3 months ago

PyTorch is also considering supporting fsspec paths automatically: pytorch/pytorch#118036

@awaelchli Do you think we need to extend checkpoint loading & saving for remote filesystem within PyTorch Lightning (if so, I'm happy to open a PR)? Or we just waiting PyTorch to support ?