Load model Error in continue training

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.59k stars 2.09k forks source link

Load model Error in continue training #210

Closed Liujingxiu23 closed 1 year ago

Liujingxiu23 commented 1 year ago

I start a train with a pretrained model and everything goes well. But when I stop the training and then restart the trainning process, the right xp sig can be found correctly, also the checkpoint file path can be found correctly, and the torch.load can be excuted successfully

But when runing the following code: sovler.musicgen.load_state_dict which call super.load_state_dict(state). the "super.load_state_dict(state)." failed

Logger info Executor: Worker 1 died, killing all workers

Liujingxiu23 commented 1 year ago

problem sovled, my own problem

WeiXipin commented 1 year ago

how 2 solve this problem?

jbmaxwell commented 10 months ago

@Liujingxiu23 , I'm having a similar issue. I want to continue training from an initial test run of 20 epochs of fine-tuning I ran from musicgen-large. I don't want to resume from the same xp because I want to reset the lr schedule. How do I continue from my own checkpoint or saved model, rather than Meta's base model (or from the current xp folder)?

I tried something like:

continue_from: //path/to/my/saved/model

in the config, but that gives me the Worker 1 died, killing all workers error that you were seeing.