Open ramdhan1989 opened 1 year ago
Hi, do you have suggestion to overcome this problem during training ?
Epoch gpu_mem box obj cls dgi total targets img_size 0/199 11G 0.1279 0.01601 0 0.008378 2.849 6 512: 100%|█| 1800/1800 [14:48<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [01:07<00 30.39782691001892 all 2.75e+03 4.51e+03 0 0 5.13e-06 9.81e-07 Epoch gpu_mem box obj cls dgi total targets img_size 1/199 11G 0.1261 0.01524 0 0.005636 2.846 6 512: 100%|█| 1800/1800 [14:03<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [01:01<00 31.465840816497803 all 2.75e+03 4.51e+03 0 0 3.55e-06 6.64e-07 Epoch gpu_mem box obj cls dgi total targets img_size 2/199 11G 0.1214 0.01546 0 0.005382 2.844 14 512: 100%|█| 1800/1800 [13:46<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [01:03<00 32.228920221328735 all 2.75e+03 4.51e+03 0.321 0.297 0.194 0.0497 Epoch gpu_mem box obj cls dgi total targets img_size 3/199 11G 0.1142 0.01436 0 0.005227 2.839 20 512: 100%|█| 1800/1800 [13:39<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:58<00 28.6451997756958 all 2.75e+03 4.51e+03 0.316 0.485 0.345 0.0999 Epoch gpu_mem box obj cls dgi total targets img_size 4/199 11G 0.09978 0.01415 0 0.005147 2.832 7 512: 100%|█| 1800/1800 [13:23<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:57<00 28.444270849227905 all 2.75e+03 4.51e+03 0.408 0.578 0.472 0.167 Epoch gpu_mem box obj cls dgi total targets img_size 5/199 11G 0.09265 0.01457 0 0.005125 2.829 5 512: 100%|█| 1800/1800 [13:32<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [01:02<00 30.84639859199524 all 2.75e+03 4.51e+03 0.399 0.623 0.507 0.161 Epoch gpu_mem box obj cls dgi total targets img_size 6/199 11G 0.08306 0.01727 0 0.005281 2.825 10 512: 100%|█| 1800/1800 [13:44<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [01:01<00 30.013824462890625 all 2.75e+03 4.51e+03 0.285 0.589 0.453 0.145 Epoch gpu_mem box obj cls dgi total targets img_size 7/199 11G nan nan 0 0.005711 nan 6 512: 100%|█| 1800/1800 [13:36<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:51<00 31.282738208770752 all 2.75e+03 4.51e+03 0 0 1.57e-06 1.74e-07 Epoch gpu_mem box obj cls dgi total targets img_size 8/199 11G nan nan 0 nan nan 10 512: 100%|█| 1800/1800 [13:31<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:49<00 32.83151125907898 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 9/199 11G nan nan 0 nan nan 9 512: 100%|█| 1800/1800 [13:20<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:45<00 29.580291509628296 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 10/199 11G nan nan 0 nan nan 4 512: 100%|█| 1800/1800 [13:25<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:48<00 32.03327965736389 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 11/199 11G nan nan 0 nan nan 9 512: 100%|█| 1800/1800 [13:28<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:47<00 30.341226816177368 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 12/199 11G nan nan 0 nan nan 2 512: 100%|█| 1800/1800 [13:11<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:45<00 29.359901189804077 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 13/199 11G nan nan 0 nan nan 13 512: 100%|█| 1800/1800 [13:05<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:45<00 29.436581134796143 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 14/199 11G nan nan 0 nan nan 7 512: 100%|█| 1800/1800 [13:04<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:45<00 29.631073713302612 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 15/199 11G nan nan 0 nan nan 6 512: 100%|█| 1800/1800 [13:08<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:45<00 29.1485652923584 all 2.75e+03 0 0 0 0 0 Epoch gpu_mem box obj cls dgi total targets img_size 16/199 11G nan nan 0 nan nan 18 512: 100%|█| 1800/1800 [13:14<00 Class Images Targets P R mAP@.5 mAP@.5:.95: 100%|█| 344/344 [00:46<00 29.673731088638306 all 2.75e+03 0 0 0 0 0
There seems a gradient explosion (or something else) that lead to a NAN loss value. What about turning down the learning rate, or clip the gradient before optimizer.step() ?
Hi, do you have suggestion to overcome this problem during training ?