can not load optimizer when --resume

indigopyj commented 2 years ago

I fine-tuned the model by my custom dataset. Then I wanted to resume training by python ./src/train.py --resume

but I've got this issue ValueError: loaded state dict has a different number of parameter groups

I didn't change the structure and the number of classes or any others about training.

How can I resume training? Plz help me

AiueoABC commented 2 years ago

same here

supperted825 commented 2 years ago

Hi @indigopyj , @AiueoABC ,

Did either of you find a solution to this issue?

supperted825 commented 2 years ago

I discovered the issue is that the new Adam optimizer has parameters of lengths [239], but the loaded optimizer has parameter lengths of [239, 18].

The additional 18 parameters are added in in BaseTrainer, for training the reID classification layers:

https://github.com/CaptainEven/MCMOT/blob/24f2efb943ecafe297a68deb7d10b45c2750894e/src/lib/trains/base_trainer.py#L38

I guess the simple way to get it to work without shifting around a bunch of code would be to just drop these additional params during loading.

I achieved this by adding the following line before loading the optimizer in /src/lib/models/model.py

del checkpoint['optimizer']['param_groups'][1]

EDIT:

Sorry no, we should not delete the additional parameters. But we can simply shift load model to after the instantiation of the trainer in train.py. This lets the trainer add the params to the optimizer before loading the saved state dict.

CaptainEven / MCMOT

can not load optimizer when --resume #92