Failed to converge - Githubissues

NVlabs / DG-Net

:couple: Joint Discriminative and Generative Learning for Person Re-identification. CVPR'19 (Oral) :couple:

https://www.zdzheng.xyz/publication/Joint-di2019

Other

1.28k stars 228 forks source link

Failed to converge #60

Open iyu-Fang opened 4 years ago

iyu-Fang commented 4 years ago

Hi, Thank you for your work.

When I tried to train my model according to the parameters you provided in configs.yaml, I found that I could not reproduce your result in the Market dataset. If I use the visual_tools to show the rainbow image, the generated image will produce the wrong color (the color is even different from the input images). After that, I've checked the loss in the tensorboard. I found that the total loss, as well as id loss, surged to very high values at 30k iteration. And then they could never converge. However, I downloaded the best model you provided and tested it in the same way, it works well.

Please give me some advice.

layumi commented 4 years ago

Hi @iyu-Fang At 30k iteration we applied the teacher loss (gradually). https://github.com/NVlabs/DG-Net/blob/c0ee2dff34662b10e904eb08249c14661f2306b1/trainer.py#L492

Do you change any parameter in configs, such as batch size or lrRate?
You may try tune down the lrRate.
Check the teacher model. Does it work good?

iyu-Fang commented 4 years ago

Hi @layumi Thank you for your advice.

Since I did not modify the batch size, I've tried to check the performance of the teacher model. I test the performance of the model (best) you provide, I got 0.81 Rank@1 and 0.54 mAP. However, even if I retrain a new teacher model and it works well, DG-Net still can not converge. BTW, I've set the _max_teacherw to 0.2, it still works badly so far.

I will be appreciated if you could give me some further suggestions.

layumi commented 4 years ago

Hi @iyu-Fang The teacher model performance is not right. Please check the version of your numpy.

https://github.com/layumi/Person_reID_baseline_pytorch#prerequisites Some reports found that updating numpy can arrive the right accuracy. If you only get 50~80 Top1 Accuracy, just try it. We have successfully run the code based on numpy 1.12.1 and 1.13.1 .

iyu-Fang commented 4 years ago

I thought that‘s not the problem. My numpy version is 1.19.1. Or could you tell me the exact version of your environment(numpy, pytorch, etc.) when you run your experiments?

layumi commented 4 years ago

Hi @iyu-Fang Could you try to run https://github.com/layumi/Person_reID_baseline_pytorch and check the result?

iyu-Fang commented 4 years ago

@layumi Actually, that's exactly how I tested. The best model got 0.810 Rank1 and 0.543 mAP, while the re-trained model (ResNet-50(all tricks)) tested 0.914 Rank1 and 0.778 mAP. But even though I use the re-trained model as my teacher model, DG-Net still cannot converge.

layumi commented 4 years ago

@iyu-Fang Did you run the model on Market-1501 or other datasets? Do you load the model config correctly?

iyu-Fang commented 4 years ago

@layumi Thank you for your quick response. Yes, I run the model on Market-1501 dataset. As for the config, the best model you provide does not use_NAS parameter, so I added it to the config and set it false. Nothing else was changed.

layumi commented 4 years ago

@iyu-Fang The teacher model should achieve about 89.6% Rank@1 and 74.5% mAP. I am not sure whether there are any other difference.