deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
23.31k stars 5.41k forks source link

训练了20多轮,此时lr=0.000001,还是会Nan #977

Open EdwardVincentMa opened 4 years ago

EdwardVincentMa commented 4 years ago

INFO:root:Epoch[24] Batch [520-540] Speed: 1727.27 samples/sec acc=0.983984 lossvalue=2.195642 INFO:root:Epoch[24] Batch [540-560] Speed: 1288.51 samples/sec acc=0.983203 lossvalue=2.237722 INFO:root:Epoch[24] Batch [560-580] Speed: 1296.01 samples/sec acc=0.983594 lossvalue=2.236295 INFO:root:Epoch[24] Batch [580-600] Speed: 1339.65 samples/sec acc=0.982617 lossvalue=2.273337 INFO:root:Epoch[24] Batch [600-620] Speed: 2007.56 samples/sec acc=0.983398 lossvalue=2.299879 INFO:root:Epoch[24] Batch [620-640] Speed: 1343.99 samples/sec acc=0.983789 lossvalue=2.221828 INFO:root:Epoch[24] Batch [640-660] Speed: 1290.73 samples/sec acc=0.983984 lossvalue=2.208789 INFO:root:Epoch[24] Batch [660-680] Speed: 1371.16 samples/sec acc=0.984375 lossvalue=2.210849 INFO:root:Epoch[24] Batch [680-700] Speed: 2012.76 samples/sec acc=0.983984 lossvalue=2.210747 INFO:root:Epoch[24] Batch [700-720] Speed: 1319.51 samples/sec acc=0.984180 lossvalue=2.203272 INFO:root:Epoch[24] Batch [720-740] Speed: 1287.94 samples/sec acc=0.984180 lossvalue=2.205525 INFO:root:Epoch[24] Batch [740-760] Speed: 1382.29 samples/sec acc=0.983984 lossvalue=2.197704 INFO:root:Epoch[24] Batch [760-780] Speed: 2022.44 samples/sec acc=0.983594 lossvalue=2.228923 INFO:root:Epoch[24] Batch [780-800] Speed: 1306.10 samples/sec acc=0.983984 lossvalue=2.219678 INFO:root:Epoch[24] Batch [800-820] Speed: 1290.28 samples/sec acc=0.984180 lossvalue=2.214591 INFO:root:Epoch[24] Batch [820-840] Speed: 1295.13 samples/sec acc=0.442383 lossvalue=nan INFO:root:Epoch[24] Batch [840-860] Speed: 1689.81 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [860-880] Speed: 1493.23 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [880-900] Speed: 1303.18 samples/sec acc=0.000000 lossvalue=nan

INFO:root:Epoch[24] Batch [1760-1780] Speed: 1374.06 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [1780-1800] Speed: 1300.23 samples/sec acc=0.000195 lossvalue=nan INFO:root:Epoch[24] Batch [1800-1820] Speed: 1303.87 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [1820-1840] Speed: 1606.54 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [1840-1860] Speed: 1593.05 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [1860-1880] Speed: 1305.64 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Batch [1880-1900] Speed: 1301.60 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[24] Train-acc=0.000000 INFO:root:Epoch[24] Train-lossvalue=nan INFO:root:Epoch[24] Time cost=363.141 call reset() INFO:root:Epoch[25] Batch [0-20] Speed: 1585.96 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[25] Batch [20-40] Speed: 1304.34 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[25] Batch [40-60] Speed: 1302.62 samples/sec acc=0.000000 lossvalue=nan INFO:root:Epoch[25] Batch [60-80] Speed: 1786.20 samples/sec acc=0.000195 lossvalue=nan lr-batch-epoch: 1e-05 99 25 testing verification.. Traceback (most recent call last): File "train.py", line 377, in main() File "train.py", line 374, in main train_net(args) File "train.py", line 369, in train_net epoch_end_callback = epoch_cb ) File "/software/python-3.6/lib/python3.6/site-packages/mxnet/module/base_module.py", line 553, in fit callback(batch_end_params) File "train.py", line 305, in _batch_callback acc_list = ver_test(mbatch) File "train.py", line 274, in ver_test acc1, std1, acc2, std2, xnorm, embeddings_list = verification.test(ver_list[i], model, args.batch_size, 10, None, None) File "eval/verification.py", line 272, in test embeddings = sklearn.preprocessing.normalize(embeddings) File "/software/python-3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1614, in normalize estimator='the normalize function', dtype=FLOAT_DTYPES) File "/software/python-3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 542, in check_array allow_nan=force_all_finite == 'allow-nan') File "/software/python-3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite raise ValueError(msg_err.format(type_err, X.dtype)) ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

EdwardVincentMa commented 4 years ago

我用的mobilefacenet+arcloss

LinearPi commented 4 years ago

请问你的电脑的配置是什么样的?

vaan2010 commented 4 years ago

arcloss的m調小去試試看

JasperRice commented 2 years ago

我发现在过cosine.acos_()之前,cosine里面出现了-1.0,导致cosine中间出现了nan。但不知道具体怎么解决。