KovenYu / MAR

Pytorch code for our CVPR'19 (oral) work: Unsupervised person re-identification by soft multilabel learning
https://kovenyu.com/publication/2019-cvpr-mar/
315 stars 83 forks source link

question about loss #5

Closed ShenXianwen closed 5 years ago

ShenXianwen commented 5 years ago

Hello,thanks for your share.

There will be an error that the value of LOSS is NAN when I run the code.I only changed the value of batchsize, the rest of the parameters all used the default values.

I don't know how to solve it. Could you give me some suggestions? Thanks.

KovenYu commented 5 years ago

Hi, thanks for your attention. @ShenXianwen

did you set a small batch size? what was the number, exactly? when was the nan appearing, during training after several epochs or right in the first epoch?

sorry I could not try by myself since I have no access to the servers until next week.

ShenXianwen commented 5 years ago

hello,thanks for your reply.I set the batch size to 64.The nan appearing during training after 2 epochs.I didn't use your prepared data(MSMT17.mat and Market.mat). I followed your steps to run the construct_dataset_Market.m and construct_dataset_MSMT17.m in MATLAB. But I used the prepared_weight.pth.

KovenYu commented 5 years ago

ok. let me try it next week when I have the access to servers.

KovenYu commented 5 years ago

Hi @ShenXianwen, it turns out that the nan comes out because the default learning rate is too large for a small batch size like 64. A small batch size indicates a stronger and sharper gradient (large batch size would average over more samples, thus smooth gradient), so we need to turn down the lr. I did not try much, but dividing the lr by 10 would enable you to get rid of this problem.

However we should note that the performance would probably drop, since the distribution estimation is less precise due to small batch size.