issue about resume - Githubissues

sujyQ commented 2 years ago

Hi.

There's a problem when resume training.

I tried to restart training DASR using this :

python main.py --dir_data='my/path' \
               --model='blindsr' \
               --scale='4' \
               --blur_type='aniso_gaussian' \
                --noise=25.0 \
               --lambda_min=0.2 \
               --lambda_max=4.0 \
               --start_epoch=157\
               --resume=157\

The problem is that contrastive loss gets bigger. I think parameters of encoder for degradation representation can't be loaded.

[Epoch 158] Learning rate: 1.00e-4
Epoch: [0158][6400/31050]   Loss [SR loss: 9.753 | contrastive loss: 0.892 ]    Time [ 145.0 s]
Epoch: [0158][12800/31050]  Loss [SR loss: 9.747 | contrastive loss: 0.920 ]    Time [ 143.7 s]
Epoch: [0158][19200/31050]  Loss [SR loss: 9.722 | contrastive loss: 0.918 ]    Time [ 144.1 s]
[Epoch 158] Learning rate: 1.00e-4
Epoch: [0158][6400/31050]   Loss [SR loss: 9.598 | contrastive loss: 7.457 ]    Time [ 145.2 s]

LongguangWang commented 2 years ago

Hi @sujyQ, we will fix this bug in an upcoming update.

sujyQ commented 2 years ago

Hi @LongguangWang , I think here is the problem.

When set strict=True, Traceback (most recent call last): File "test.py", line 19, in <module> model = model.Model(args, checkpoint) File "/home/hsj/d_drive/hsj/hsj/DASR_DDF/model/__init__.py", line 35, in __init__ cpu=args.cpu File "/home/hsj/d_drive/hsj/hsj/DASR_DDF/model/__init__.py", line 104, in load strict=True File "/home/hsj/anaconda3/envs/pytorch36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for BlindSR: Missing key(s) in state_dict: "E.queue", "E.queue_ptr", "E.encoder_k.E.0.weight", "E.encoder_k.E.0.bias", "E.encoder_k.E.1.weight", "E.encoder_k.E.1.bias", "E.encoder_k.E.1.running_mean", "E.encoder_k.E.1.running_var", "E.encoder_k.E.1.num_batches_tracked", "E.encoder_k.E.3.weight", "E.encoder_k.E.3.bias", "E.encoder_k.E.4.weight", "E.encoder_k.E.4.bias", "E.encoder_k.E.4.running_mean", "E.encoder_k.E.4.running_var", "E.encoder_k.E.4.num_batches_tracked", "E.encoder_k.E.6.weight", "E.encoder_k.E.6.bias", "E.encoder_k.E.7.weight", "E.encoder_k.E.7.bias", "E.encoder_k.E.7.running_mean", "E.encoder_k.E.7.running_var", "E.encoder_k.E.7.num_batches_tracked", "E.encoder_k.E.9.weight", "E.encoder_k.E.9.bias", "E.encoder_k.E.10.weight", "E.encoder_k.E.10.bias", "E.encoder_k.E.10.running_mean", "E.encoder_k.E.10.running_var", "E.encoder_k.E.10.num_batches_tracked", "E.encoder_k.E.12.weight", "E.encoder_k.E.12.bias", "E.encoder_k.E.13.weight", "E.encoder_k.E.13.bias", "E.encoder_k.E.13.running_mean", "E.encoder_k.E.13.running_var", "E.encoder_k.E.13.num_batches_tracked", "E.encoder_k.E.15.weight", "E.encoder_k.E.15.bias", "E.encoder_k.E.16.weight", "E.encoder_k.E.16.bias", "E.encoder_k.E.16.running_mean", "E.encoder_k.E.16.running_var", "E.encoder_k.E.16.num_batches_tracked", "E.encoder_k.mlp.0.weight", "E.encoder_k.mlp.0.bias", "E.encoder_k.mlp.2.weight", "E.encoder_k.mlp.2.bias". occurs.

tongchangD commented 2 years ago

How did you solve this problem https://github.com/LongguangWang/DASR/issues/34#issuecomment-923644014

The-Learning-And-Vision-Atelier-LAVA / DASR

issue about resume #34