KovenYu / MAR

Pytorch code for our CVPR'19 (oral) work: Unsupervised person re-identification by soft multilabel learning
https://kovenyu.com/publication/2019-cvpr-mar/
315 stars 83 forks source link

nan error #35

Closed shuxjweb closed 4 years ago

shuxjweb commented 4 years ago

I used default parameters, except the batch size that is changed to 64 due to the small GPU memory. However, nan error appears after the first epoch:

python version : 3.5.2 (default, Oct 8 2019, 13:06:37) [GCC 5.4.0 20160609] torch version : 1.0.1

------------------------------------------------------- options -------------------------------------------------------- batch_size: 64 beta: 0.2 crop_size: (384, 128)
epochs: 20 gpu: 0 img_size: (384, 128)
lamb_1: 0.0002 lamb_2: 50.0 lr: 0.0002
margin: 1.0 mining_ratio: 0.005 ml_path: data/ml_Market.dat
padding: 7 pretrain_path: data/pretrained_weight.pth print_freq: 100
resume: save_path: debug scala_ce: 30.0
source: MSMT17 target: Market wd: 0.025

loaded pre-trained model from data/pretrained_weight.pth

==>>[2020-04-17 18:03:46] [Epoch=000/020] Stage 1, [Need: 00:00:00] initializing centers/threshold ... loaded ml from data/ml_Market.dat initializing centers done. initializing threshold done. Iter: [000/3877] Freq 10.6 loss_target 0.000 loss_source 0.070 loss_ml 13879.977 loss_st 0.451 loss_total 10.812 [2020-04-17 18:04:08] Iter: [100/3877] Freq 129.1 loss_target 0.000 loss_source 0.223 loss_ml 12678.907 loss_st 0.578 loss_total 19.478 [2020-04-17 18:04:52] Iter: [200/3877] Freq 136.8 loss_target 0.000 loss_source 0.718 loss_ml 11486.483 loss_st 0.706 loss_total 45.257 [2020-04-17 18:05:36] Iter: [300/3877] Freq 139.5 loss_target 0.000 loss_source 1.211 loss_ml 10696.569 loss_st 0.762 loss_total 70.312 [2020-04-17 18:06:20] Iter: [400/3877] Freq 141.0 loss_target 0.000 loss_source 1.447 loss_ml 10321.419 loss_st 0.782 loss_total 82.236 [2020-04-17 18:07:04] Iter: [500/3877] Freq 141.1 loss_target 0.000 loss_source 1.512 loss_ml 10035.379 loss_st 0.787 loss_total 85.473 [2020-04-17 18:07:49] Iter: [600/3877] Freq 141.7 loss_target 0.000 loss_source 1.521 loss_ml 9804.266 loss_st 0.784 loss_total 85.846 [2020-04-17 18:08:33] Iter: [700/3877] Freq 142.2 loss_target 0.000 loss_source 1.504 loss_ml 9656.519 loss_st 0.777 loss_total 84.899 [2020-04-17 18:09:18] Iter: [800/3877] Freq 142.5 loss_target 0.000 loss_source 1.480 loss_ml 9529.720 loss_st 0.770 loss_total 83.625 [2020-04-17 18:10:02] Iter: [900/3877] Freq 142.3 loss_target 0.000 loss_source 1.448 loss_ml 9396.765 loss_st 0.765 loss_total 81.939 [2020-04-17 18:10:47] Iter: [1000/3877] Freq 142.5 loss_target 0.000 loss_source 1.417 loss_ml 9326.110 loss_st 0.761 loss_total 80.334 [2020-04-17 18:11:32] Iter: [1100/3877] Freq 142.7 loss_target 0.000 loss_source 1.386 loss_ml 9234.825 loss_st 0.757 loss_total 78.692 [2020-04-17 18:12:16] Iter: [1200/3877] Freq 142.8 loss_target 0.000 loss_source 1.354 loss_ml 9180.113 loss_st 0.752 loss_total 77.062 [2020-04-17 18:13:00] Iter: [1300/3877] Freq 142.7 loss_target 0.000 loss_source 1.325 loss_ml 9123.445 loss_st 0.746 loss_total 75.557 [2020-04-17 18:13:46] Iter: [1400/3877] Freq 142.8 loss_target 0.000 loss_source 1.297 loss_ml 9052.444 loss_st 0.742 loss_total 74.055 [2020-04-17 18:14:30] Iter: [1500/3877] Freq 142.9 loss_target 0.000 loss_source 1.268 loss_ml 8993.854 loss_st 0.737 loss_total 72.588 [2020-04-17 18:15:14] Iter: [1600/3877] Freq 143.0 loss_target 0.000 loss_source 1.240 loss_ml 8949.674 loss_st 0.733 loss_total 71.113 [2020-04-17 18:15:58] Iter: [1700/3877] Freq 142.9 loss_target 0.000 loss_source 1.216 loss_ml 8908.284 loss_st 0.730 loss_total 69.876 [2020-04-17 18:16:44] Iter: [1800/3877] Freq 143.0 loss_target 0.000 loss_source 1.191 loss_ml 8866.926 loss_st 0.726 loss_total 68.567 [2020-04-17 18:17:28] Iter: [1900/3877] Freq 143.1 loss_target 0.000 loss_source 1.167 loss_ml 8835.746 loss_st 0.722 loss_total 67.353 [2020-04-17 18:18:12] Iter: [2000/3877] Freq 143.2 loss_target 0.000 loss_source 1.142 loss_ml 8806.737 loss_st 0.718 loss_total 66.061 [2020-04-17 18:18:56] Iter: [2100/3877] Freq 143.1 loss_target 0.000 loss_source 1.121 loss_ml 8780.041 loss_st 0.715 loss_total 64.979 [2020-04-17 18:19:42] Iter: [2200/3877] Freq 143.2 loss_target 0.000 loss_source 1.102 loss_ml 8744.079 loss_st 0.712 loss_total 63.964 [2020-04-17 18:20:26] Iter: [2300/3877] Freq 143.3 loss_target 0.000 loss_source 1.086 loss_ml 8710.513 loss_st 0.710 loss_total 63.124 [2020-04-17 18:21:10] Iter: [2400/3877] Freq 143.3 loss_target 0.000 loss_source 1.068 loss_ml 8682.339 loss_st 0.707 loss_total 62.225 [2020-04-17 18:21:54] Iter: [2500/3877] Freq 143.2 loss_target 0.000 loss_source 1.054 loss_ml 8654.118 loss_st 0.705 loss_total 61.497 [2020-04-17 18:22:40] Iter: [2600/3877] Freq 143.3 loss_target 0.000 loss_source 1.039 loss_ml 8635.352 loss_st 0.703 loss_total 60.705 [2020-04-17 18:23:24] Iter: [2700/3877] Freq 143.3 loss_target 0.000 loss_source 1.026 loss_ml 8602.657 loss_st 0.701 loss_total 60.008 [2020-04-17 18:24:08] Iter: [2800/3877] Freq 143.4 loss_target 0.000 loss_source 1.011 loss_ml 8580.846 loss_st 0.698 loss_total 59.240 [2020-04-17 18:24:52] Iter: [2900/3877] Freq 143.3 loss_target 0.000 loss_source 0.997 loss_ml 8564.657 loss_st 0.696 loss_total 58.499 [2020-04-17 18:25:38] Iter: [3000/3877] Freq 143.3 loss_target 0.000 loss_source 0.983 loss_ml 8544.973 loss_st 0.694 loss_total 57.802 [2020-04-17 18:26:22] Iter: [3100/3877] Freq 143.4 loss_target 0.000 loss_source 0.971 loss_ml 8523.918 loss_st 0.692 loss_total 57.159 [2020-04-17 18:27:06] Iter: [3200/3877] Freq 143.4 loss_target 0.000 loss_source 0.959 loss_ml 8506.227 loss_st 0.691 loss_total 56.549 [2020-04-17 18:27:51] Iter: [3300/3877] Freq 143.3 loss_target 0.000 loss_source 0.948 loss_ml 8495.211 loss_st 0.689 loss_total 56.004 [2020-04-17 18:28:36] Iter: [3400/3877] Freq 143.4 loss_target 0.000 loss_source 0.936 loss_ml 8476.330 loss_st 0.687 loss_total 55.355 [2020-04-17 18:29:20] Iter: [3500/3877] Freq 143.4 loss_target 0.000 loss_source 0.925 loss_ml 8460.062 loss_st 0.685 loss_total 54.781 [2020-04-17 18:30:04] Iter: [3600/3877] Freq 143.5 loss_target 0.000 loss_source 0.913 loss_ml 8443.244 loss_st 0.684 loss_total 54.175 [2020-04-17 18:30:48] Iter: [3700/3877] Freq 143.4 loss_target 0.000 loss_source 0.901 loss_ml 8419.879 loss_st 0.682 loss_total 53.564 [2020-04-17 18:31:34] Iter: [3800/3877] Freq 143.4 loss_target 0.000 loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:32:18] Train loss_target 0.000 loss_source nan loss_ml nan loss_st nan loss_total nan

==>>[2020-04-17 18:32:53] [Epoch=001/020] Stage 1, [Need: 09:13:00] Iter: [000/3877] Freq 43.2 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:32:54] Iter: [100/3877] Freq 137.4 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:33:40] Iter: [200/3877] Freq 137.9 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:34:26] Iter: [300/3877] Freq 138.1 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:35:12] Iter: [400/3877] Freq 138.2 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:35:58] Iter: [500/3877] Freq 137.1 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:36:46] Iter: [600/3877] Freq 136.9 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:37:34] Iter: [700/3877] Freq 136.6 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:38:21] Iter: [800/3877] Freq 136.7 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:39:08] Iter: [900/3877] Freq 136.0 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:39:57] Iter: [1000/3877] Freq 136.2 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:40:43] Iter: [1100/3877] Freq 136.4 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:41:29] Iter: [1200/3877] Freq 136.6 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:42:15] Iter: [1300/3877] Freq 136.6 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:43:02] Iter: [1400/3877] Freq 136.8 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:43:48] Iter: [1500/3877] Freq 136.9 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:44:34] Iter: [1600/3877] Freq 137.0 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:45:20] Iter: [1700/3877] Freq 137.0 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:46:07] Iter: [1800/3877] Freq 137.1 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:46:53] Iter: [1900/3877] Freq 137.1 loss_target nan loss_source nan loss_ml nan loss_st nan loss_total nan [2020-04-17 18:47:40]

xiekun2019 commented 4 years ago

I have encountered this problem too, you should try lower lr, for example lr = 0.00002, or use the default setting of this paper.