Closed ad12 closed 11 months ago
I didn't fully understand the motivation. What if you added a step at the end of your script that moves/copies the symlink target to the symlink location? This should give you the behaviour that you want without having to copy every in-between last checkpoint along the way. Do you see any problems with this alternative?
I'm also wondering what the challenge is here. A symbolic link is a file too, so it can be backed up. And it (normally) is a relative path, so if you download the checkpoint folder from your backup to a different location, the link will continue to just work.
there are several symlink files that can be output to an experiment folder (e.g. W&B run symlinks). Managing symlinks independently of one another is an additional overhead for certain libraries (e.g. aws cli
, rclone
, etc.)
Having an option save_last=copy
would allow defaulting to original behavior of checkpointing last.ckpt
with minimal implementation overhead.
What` if you added a step at the end of your script that moves/copies the symlink target to the symlink location? This should give you the behaviour that you want without having to copy every in-between last checkpoint along the way.
it would be helpful to be able to do this during training. A workaround is to either
Both of these seem more involved than adding a save_last='copy'
option, which may be useful to other users who may want to default to the old checkpointing functionality
I am facing the same problem.The old version just saves the ckpt file.And now, i get a fking stupid symbolic link.It is so fking useless,I do not want a fking SYMBOLIC LINK.I WANT MY WEIGHTS BACKING!!!!!
I'm also wondering what the challenge is here. A symbolic link is a file too, so it can be backed up. And it (normally) is a relative path, so if you download the checkpoint folder from your backup to a different location, the link will continue to just work.
Not all commands are designed to handle symlinks. For instance, I just faced an issue with pathlib
:
from pathlib import Path
assert Path("last.ckpt").exists(), f'{last.ckpt} does not exists!'
This raises an assertion!
Therefore, in a workflow, when we want to check if a checkpoint path exists before loading from it, this will not work!
To add to my previous comment, it is possible to get the actual path using the resolve()
method however...
To get the actual path from the symlink, we can use
resolved_path = Path("last.ckpt").resolve()
Let's say, that the user has renamed one of the parent directories in the path of last.ckpt. This means, resolved_path
will always point to the wrong target. This effectively makes last.ckpt
unaccessible.
I think it is a great option to consider this feature request. If required, I am willing to contribute to this issue. Although, I need to understard the entire workflow!
Looking forward to your sugestions!
@bgswaroop This feature request is completed. And https://github.com/Lightning-AI/pytorch-lightning/pull/19303 made the link consistently relative.
Description & Motivation
Saving the
last.ckpt
as a symlink on local file systems makes a lot of sense for most workflows. However, in a several cases, users often back up their checkpoints to cloud storage (AWS, GCP, etc.). In these scenarios, it is difficult to manage symlinks because they are often an all-or-nothing upload -- i.e. we cannot choose which symlinks to upload without being highly prescriptive on upload.Checkpoints, especially
last.ckpt
, are critical for resuming runs, fine-tuning, etc. So we often want to back these up. However, whenlast.ckpt
is a symlink, the backup process to cloud becomes much more involved.Pitch
Add option `save_last=copy', where we save a copy of the last checkpoint
Alternatives
No response
Additional context
No response
cc @borda @carmocca @awaelchli