Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.39k stars 3.38k forks source link

Support relative paths in ModelCheckpoint state #14443

Open samgelman opened 2 years ago

samgelman commented 2 years ago

🚀 Feature

It would be great if ModelCheckpoint internal state supported relative paths.

Motivation

Currently, if you specify a relative path for dirpath, ModelCheckpoint converts it to an absolute path under the hood. This makes it hard to resume training if the log directory is moved, or if resuming training from a different server with a different directory structure.

For example, I specify the relative path a/b/c, and ModelCheckpoint converts it to cwd()/a/b/c. The model trains correctly for a while. Then, HTCondor reschedules my job on a different server. Now, the cwd() is different, even though the relative path a/b/c is still the same. The job is unable to resume from checkpoint and I get:

UserWarning: The dirpath has changed from <server 1 dirpath> to <server 2 dirpath>, therefore best_model_score, kth_best_model_path, kth_value, last_model_path and best_k_models won’t be reloaded.

Pitch

Create an argument relative_paths=True that would allow ModelCheckpoint to use relative paths in its internal state.

Alternatives

User can create their own checkpoint callback that supports relative paths. But, it would be much nicer if Lightning supported it :-)

Additional context


If you enjoy Lightning, check out our other projects! âš¡

cc @borda @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7 @akihironitta

rohitgr7 commented 2 years ago

wondering what'd be the issues of only using relative paths i.e not changing paths at all.

carmocca commented 2 years ago

I think we could change it. It started to use realpath in https://github.com/Lightning-AI/lightning/pull/2153 with the intent to fix a logic bug when filepath=a_local_file was passed. But filepath does not exist anymore.

It would still be a breaking change as it's possible somebody relies on it being a realpath.