Open Stack-Attack opened 2 years ago
I wonder if it would be beneficial to allow the 'ckpt_path' of Trainer.fit() to accept a dict loaded from a .ckpt file using torch.load. Then you could manually remove problematic state dicts if required.
i.e
weights = torch.load(checkpoint, map_location=model.device)
del weights['callbacks']
trainer.fit(ckpt=weights)
@Stack-Attack Hello! I face the same issue, but on the same machine.
I do the following:
1) train some experiment with neptune logger, model checkpoint callback
2) start to finetune from saved full .ckpt (from 1) with different experiment name on the same machine
3) see this mistake when all paths are existing and not changed
Using torch.load and removing callback state is the best way for this issue?
@ArtemSivtsov Yes, for now if I make any large changes to a model or experiment I make a new run, load the weights manually, and train with an empty checkpoint. Roughly the following logic:
if cfg.trainer_cfg.new_run and checkpoint is not None:
weights = torch.load(checkpoint, map_location=model.device)
model.load_state_dict(weights["state_dict"])
checkpoint = None
trainer.fit(model, datamodule=datamodule, ckpt_path=checkpoint)
@Stack-Attack Thank you so much for a quick reply! I hope Lightning team will fix that behavior later :)
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Hey @Stack-Attack, @ArtemSivtsov Aleksander here - eng. at Neptune.ai. I came across this issue a couple of days ago. Would it be possible for you to create/share a minimal code snippet that would help reproduce the bug?
Bug description
Loading a checkpoint with the ModelCheckpoint callback on a different machine (or with a missing/moved "best_model_path" dir) results in an error and crash.
A common use case for me is to train a model (with .ckpt stored elsewhere i.e Neptune), and then pull the checkpoint from that model to another machine to continue training later. This used to work in older versions, but now breaks. Currently, the code deals with situations where the directory structure has changed, but not for larger changes in the absolute file-structure.
How to reproduce the bug
Error messages and logs
Error messages and logs here please
Environment
More info
It seems like the simplest and maybe most straight forward solution is to not restore the ModelCheckpoint state at all if the directory has changed. There are more complex solutions (like checking each field) but given that this specific checkpoint is tightly coupled with the file structure it seems ill advised.