Closed Liujingxiu23 closed 1 year ago
problem sovled, my own problem
how 2 solve this problem?
@Liujingxiu23 , I'm having a similar issue. I want to continue training from an initial test run of 20 epochs of fine-tuning I ran from musicgen-large
. I don't want to resume from the same xp because I want to reset the lr schedule. How do I continue from my own checkpoint or saved model, rather than Meta's base model (or from the current xp folder)?
I tried something like:
continue_from: //path/to/my/saved/model
in the config, but that gives me the Worker 1 died, killing all workers
error that you were seeing.
I start a train with a pretrained model and everything goes well. But when I stop the training and then restart the trainning process, the right xp sig can be found correctly, also the checkpoint file path can be found correctly, and the torch.load can be excuted successfully
But when runing the following code: sovler.musicgen.load_state_dict which call super.load_state_dict(state). the "super.load_state_dict(state)." failed
Logger info Executor: Worker 1 died, killing all workers