Option to save last checkpoint as copy instead of symlinking

Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

https://lightning.ai

Apache License 2.0

28.53k stars 3.39k forks source link

Option to save last checkpoint as copy instead of symlinking #18995

Closed ad12 closed 11 months ago

ad12 commented 1 year ago

Description & Motivation

Saving the last.ckpt as a symlink on local file systems makes a lot of sense for most workflows. However, in a several cases, users often back up their checkpoints to cloud storage (AWS, GCP, etc.). In these scenarios, it is difficult to manage symlinks because they are often an all-or-nothing upload -- i.e. we cannot choose which symlinks to upload without being highly prescriptive on upload.

Checkpoints, especially last.ckpt, are critical for resuming runs, fine-tuning, etc. So we often want to back these up. However, when last.ckpt is a symlink, the backup process to cloud becomes much more involved.

Pitch

Add option `save_last=copy', where we save a copy of the last checkpoint

Alternatives

No response

Additional context

No response

cc @borda @carmocca @awaelchli

carmocca commented 1 year ago

I didn't fully understand the motivation. What if you added a step at the end of your script that moves/copies the symlink target to the symlink location? This should give you the behaviour that you want without having to copy every in-between last checkpoint along the way. Do you see any problems with this alternative?

awaelchli commented 1 year ago

I'm also wondering what the challenge is here. A symbolic link is a file too, so it can be backed up. And it (normally) is a relative path, so if you download the checkpoint folder from your backup to a different location, the link will continue to just work.

ad12 commented 1 year ago

there are several symlink files that can be output to an experiment folder (e.g. W&B run symlinks). Managing symlinks independently of one another is an additional overhead for certain libraries (e.g. aws cli, rclone, etc.)

Having an option save_last=copy would allow defaulting to original behavior of checkpointing last.ckpt with minimal implementation overhead.

ad12 commented 1 year ago

What` if you added a step at the end of your script that moves/copies the symlink target to the symlink location? This should give you the behaviour that you want without having to copy every in-between last checkpoint along the way.

it would be helpful to be able to do this during training. A workaround is to either

have a different process that watches the checkpoints folder and runs a sync when things change
have a callback that does the syncing every time a checkpoint is made

Both of these seem more involved than adding a save_last='copy' option, which may be useful to other users who may want to default to the old checkpointing functionality

dljjqy commented 12 months ago

I am facing the same problem.The old version just saves the ckpt file.And now, i get a fking stupid symbolic link.It is so fking useless,I do not want a fking SYMBOLIC LINK.I WANT MY WEIGHTS BACKING!!!!!

bgswaroop commented 10 months ago

I'm also wondering what the challenge is here. A symbolic link is a file too, so it can be backed up. And it (normally) is a relative path, so if you download the checkpoint folder from your backup to a different location, the link will continue to just work.

Not all commands are designed to handle symlinks. For instance, I just faced an issue with pathlib: from pathlib import Path assert Path("last.ckpt").exists(), f'{last.ckpt} does not exists!' This raises an assertion!

Therefore, in a workflow, when we want to check if a checkpoint path exists before loading from it, this will not work!

bgswaroop commented 10 months ago

To add to my previous comment, it is possible to get the actual path using the resolve() method however...

To get the actual path from the symlink, we can use resolved_path = Path("last.ckpt").resolve()

Let's say, that the user has renamed one of the parent directories in the path of last.ckpt. This means, resolved_path will always point to the wrong target. This effectively makes last.ckpt unaccessible.

I think it is a great option to consider this feature request. If required, I am willing to contribute to this issue. Although, I need to understard the entire workflow!

Looking forward to your sugestions!

awaelchli commented 10 months ago

@bgswaroop This feature request is completed. And https://github.com/Lightning-AI/pytorch-lightning/pull/19303 made the link consistently relative.