PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.74k stars 7.68k forks source link

det_mv3_db AMP训练精度异常 #6848

Closed liddk closed 1 year ago

liddk commented 2 years ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

andyjiang1116 commented 2 years ago

感谢反馈,该问题我们这边已经记录,修复后会及时给你答复

liddk commented 2 years ago

@andyjpaddle 4卡AMP训练也会遇到相同问题。 python3 -u -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/det/det_mv3_db.yml -o Global.use_visualdl=True Global.use_amp=True Global.scale_loss=1024.0 Global.use_dynamic_loss_scaling=True Global.print_batch_step=1


[2022/08/22 05:53:21] ppocr INFO: epoch: [15/1200], global_step: 225, lr: 0.001000, loss: 5.017886, loss_shrink_maps: 3.560206, loss_threshold_maps: 0.923485, loss_binary_maps: 0.532439, avg_reader_cost: 4.36707 s, avg_batch_cost: 8.26320 s, avg_samples: 16.0, ips: 1.93630 samples/s, eta: 5:40:50 [2022/08/22 05:53:24] ppocr INFO: epoch: [15/1200], global_step: 226, lr: 0.001000, loss: 5.050126, loss_shrink_maps: 3.596053, loss_threshold_maps: 0.923485, loss_binary_maps: 0.532439, avg_reader_cost: 0.03259 s, avg_batch_cost: 2.84988 s, avg_samples: 16.0, ips: 5.61428 samples/s, eta: 5:43:18 Found inf or nan, current scale is: 4.1359030627651384e-25, decrease to: 4.1359030627651384e-250.5 [2022/08/22 05:53:26] ppocr INFO: epoch: [15/1200], global_step: 227, lr: 0.001000, loss: 5.017886, loss_shrink_maps: 3.560206, loss_threshold_maps: 0.923485, loss_binary_maps: 0.519505, avg_reader_cost: 1.81280 s, avg_batch_cost: 2.22363 s, avg_samples: 16.0, ips: 7.19544 samples/s, eta: 5:44:52 [2022/08/22 05:53:26] ppocr INFO: epoch: [15/1200], global_step: 228, lr: 0.001000, loss: 4.976384, loss_shrink_maps: 3.525608, loss_threshold_maps: 0.934624, loss_binary_maps: 0.505154, avg_reader_cost: 0.00012 s, avg_batch_cost: 0.39192 s, avg_samples: 16.0, ips: 40.82510 samples/s, eta: 5:43:53 Found inf or nan, current scale is: 2.0679515313825692e-25, decrease to: 2.0679515313825692e-250.5 [2022/08/22 05:53:27] ppocr INFO: epoch: [15/1200], global_step: 229, lr: 0.001000, loss: 4.976384, loss_shrink_maps: 3.525608, loss_threshold_maps: 0.923485, loss_binary_maps: 0.500928, avg_reader_cost: 0.00016 s, avg_batch_cost: 0.67728 s, avg_samples: 16.0, ips: 23.62399 samples/s, eta: 5:43:18 [2022/08/22 05:53:28] ppocr INFO: epoch: [15/1200], global_step: 230, lr: 0.001000, loss: 4.992502, loss_shrink_maps: 3.560206, loss_threshold_maps: 0.934624, loss_binary_maps: 0.505364, avg_reader_cost: 0.01024 s, avg_batch_cost: 0.31827 s, avg_samples: 16.0, ips: 50.27103 samples/s, eta: 5:42:13 Found inf or nan, current scale is: 1.0339757656912846e-25, decrease to: 1.0339757656912846e-250.5 [2022/08/22 05:53:28] ppocr INFO: epoch: [15/1200], global_step: 231, lr: 0.001000, loss: 5.019096, loss_shrink_maps: 3.588006, loss_threshold_maps: 0.934624, loss_binary_maps: 0.519505, avg_reader_cost: 0.00018 s, avg_batch_cost: 0.41058 s, avg_samples: 16.0, ips: 38.96882 samples/s, eta: 5:41:17 [2022/08/22 05:53:28] ppocr INFO: epoch: [15/1200], global_step: 232, lr: 0.001000, loss: 4.992502, loss_shrink_maps: 3.560206, loss_threshold_maps: 0.934624, loss_binary_maps: 0.519505, avg_reader_cost: 0.00028 s, avg_batch_cost: 0.18392 s, avg_samples: 16.0, ips: 86.99213 samples/s, eta: 5:40:03 Found inf or nan, current scale is: 5.169878828456423e-26, decrease to: 5.169878828456423e-260.5 [2022/08/22 05:53:29] ppocr INFO: epoch: [15/1200], global_step: 233, lr: 0.001000, loss: 5.019096, loss_shrink_maps: 3.588006, loss_threshold_maps: 0.938214, loss_binary_maps: 0.519505, avg_reader_cost: 0.00019 s, avg_batch_cost: 0.80283 s, avg_samples: 16.0, ips: 19.92946 samples/s, eta: 5:39:39 [2022/08/22 05:53:30] ppocr INFO: epoch: [15/1200], global_step: 234, lr: 0.001000, loss: 4.995037, loss_shrink_maps: 3.574678, loss_threshold_maps: 0.938214, loss_binary_maps: 0.511027, avg_reader_cost: 0.00014 s, avg_batch_cost: 0.77931 s, avg_samples: 16.0, ips: 20.53093 samples/s, eta: 5:39:14 Found inf or nan, current scale is: 2.5849394142282115e-26, decrease to: 2.5849394142282115e-260.5 [2022/08/22 05:53:31] ppocr INFO: epoch: [15/1200], global_step: 235, lr: 0.001000, loss: 5.026068, loss_shrink_maps: 3.602726, loss_threshold_maps: 0.938214, loss_binary_maps: 0.519245, avg_reader_cost: 0.09581 s, avg_batch_cost: 0.50636 s, avg_samples: 16.0, ips: 31.59786 samples/s, eta: 5:38:28 [2022/08/22 05:53:31] ppocr INFO: epoch: [15/1200], global_step: 236, lr: 0.001000, loss: 5.026068, loss_shrink_maps: 3.574678, loss_threshold_maps: 0.938214, loss_binary_maps: 0.519245, avg_reader_cost: 0.00014 s, avg_batch_cost: 0.17820 s, avg_samples: 16.0, ips: 89.78771 samples/s, eta: 5:37:15 Found inf or nan, current scale is: 1.2924697071141057e-26, decrease to: 1.2924697071141057e-260.5 [2022/08/22 05:53:31] ppocr INFO: epoch: [15/1200], global_step: 237, lr: 0.001000, loss: 5.055849, loss_shrink_maps: 3.602726, loss_threshold_maps: 0.938214, loss_binary_maps: 0.519245, avg_reader_cost: 0.00011 s, avg_batch_cost: 0.19194 s, avg_samples: 16.0, ips: 83.35977 samples/s, eta: 5:36:04 [2022/08/22 05:53:31] ppocr INFO: epoch: [15/1200], global_step: 238, lr: 0.001000, loss: 5.064687, loss_shrink_maps: 3.629196, loss_threshold_maps: 0.938214, loss_binary_maps: 0.527722, avg_reader_cost: 0.00014 s, avg_batch_cost: 0.18781 s, avg_samples: 16.0, ips: 85.19034 samples/s, eta: 5:34:53 Found inf or nan, current scale is: 6.462348535570529e-27, decrease to: 6.462348535570529e-270.5 [2022/08/22 05:53:31] ppocr INFO: epoch: [15/1200], global_step: 239, lr: 0.001000, loss: 5.055849, loss_shrink_maps: 3.602726, loss_threshold_maps: 0.932965, loss_binary_maps: 0.519245, avg_reader_cost: 0.00010 s, avg_batch_cost: 0.18466 s, avg_samples: 16.0, ips: 86.64496 samples/s, eta: 5:33:42 [2022/08/22 05:53:32] ppocr INFO: epoch: [15/1200], global_step: 240, lr: 0.001000, loss: 5.044233, loss_shrink_maps: 3.589090, loss_threshold_maps: 0.930141, loss_binary_maps: 0.519245, avg_reader_cost: 0.00009 s, avg_batch_cost: 0.13010 s, avg_samples: 10.0, ips: 76.86211 samples/s, eta: 5:32:28 [2022/08/22 05:53:32] ppocr INFO: save model in ./output/db_mv3/latest Found inf or nan, current scale is: 3.2311742677852644e-27, decrease to: 3.2311742677852644e-270.5 [2022/08/22 05:53:40] ppocr INFO: epoch: [16/1200], global_step: 241, lr: 0.001000, loss: 5.036156, loss_shrink_maps: 3.589090, loss_threshold_maps: 0.930141, loss_binary_maps: 0.511027, avg_reader_cost: 7.09716 s, avg_batch_cost: 8.44124 s, avg_samples: 16.0, ips: 1.89546 samples/s, eta: 5:42:09 [2022/08/22 05:53:41] ppocr INFO: epoch: [16/1200], global_step: 242, lr: 0.001000, loss: 5.014528, loss_shrink_maps: 3.569089, loss_threshold_maps: 0.924251, loss_binary_maps: 0.508085, avg_reader_cost: 0.00038 s, avg_batch_cost: 0.59627 s, avg_samples: 16.0, ips: 26.83352 samples/s, eta: 5:41:29 Found inf or nan, current scale is: 1.6155871338926322e-27, decrease to: 1.6155871338926322e-270.5 [2022/08/22 05:53:43] ppocr INFO: epoch: [16/1200], global_step: 243, lr: 0.001000, loss: 4.995037, loss_shrink_maps: 3.589090, loss_threshold_maps: 0.916466, loss_binary_maps: 0.503859, avg_reader_cost: 0.03652 s, avg_batch_cost: 2.02414 s, avg_samples: 16.0, ips: 7.90457 samples/s, eta: 5:42:42 [2022/08/22 05:53:44] ppocr INFO: epoch: [16/1200], global_step: 244, lr: 0.001000, loss: 5.013463, loss_shrink_maps: 3.599968, loss_threshold_maps: 0.920357, loss_binary_maps: 0.503859, avg_reader_cost: 0.00017 s, avg_batch_cost: 0.45545 s, avg_samples: 16.0, ips: 35.13011 samples/s, eta: 5:41:52 Found inf or nan, current scale is: 8.077935669463161e-28, decrease to: 8.077935669463161e-280.5 [2022/08/22 05:53:44] ppocr INFO: epoch: [16/1200], global_step: 245, lr: 0.001000, loss: 5.013463, loss_shrink_maps: 3.599968, loss_threshold_maps: 0.924393, loss_binary_maps: 0.503859, avg_reader_cost: 0.00017 s, avg_batch_cost: 0.48138 s, avg_samples: 16.0, ips: 33.23804 samples/s, eta: 5:41:04 [2022/08/22 05:53:44] ppocr INFO: epoch: [16/1200], global_step: 246, lr: 0.001000, loss: 5.013463, loss_shrink_maps: 3.599968, loss_threshold_maps: 0.928286, loss_binary_maps: 0.508085, avg_reader_cost: 0.00014 s, avg_batch_cost: 0.17853 s, avg_samples: 16.0, ips: 89.62308 samples/s, eta: 5:39:54 Found inf or nan, current scale is: 4.0389678347315804e-28, decrease to: 4.0389678347315804e-280.5 [2022/08/22 05:53:45] ppocr INFO: epoch: [16/1200], global_step: 247, lr: 0.001000, loss: 5.013463, loss_shrink_maps: 3.599968, loss_threshold_maps: 0.924393, loss_binary_maps: 0.503859, avg_reader_cost: 0.00017 s, avg_batch_cost: 0.38103 s, avg_samples: 16.0, ips: 41.99167 samples/s, eta: 5:38:59 [2022/08/22 05:53:45] ppocr INFO: epoch: [16/1200], global_step: 248, lr: 0.001000, loss: 5.031628, loss_shrink_maps: 3.610502, loss_threshold_maps: 0.924393, loss_binary_maps: 0.509523, avg_reader_cost: 0.00018 s, avg_batch_cost: 0.27947 s, avg_samples: 16.0, ips: 57.25109 samples/s, eta: 5:37:58 Found inf or nan, current scale is: 2.0194839173657902e-28, decrease to: 2.0194839173657902e-280.5 [2022/08/22 05:53:46] ppocr INFO: epoch: [16/1200], global_step: 249, lr: 0.001000, loss: 5.031628, loss_shrink_maps: 3.610502, loss_threshold_maps: 0.924393, loss_binary_maps: 0.509523, avg_reader_cost: 0.04839 s, avg_batch_cost: 0.59589 s, avg_samples: 16.0, ips: 26.85039 samples/s, eta: 5:37:21 [2022/08/22 05:53:46] ppocr INFO: epoch: [16/1200], global_step: 250, lr: 0.001000, loss: 5.031628, loss_shrink_maps: 3.610502, loss_threshold_maps: 0.924393, loss_binary_maps: 0.509523, avg_reader_cost: 0.00014 s, avg_batch_cost: 0.31812 s, avg_samples: 16.0, ips: 50.29549 samples/s, eta: 5:36:23 Found inf or nan, current scale is: 1.0097419586828951e-28, decrease to: 1.0097419586828951e-280.5 [2022/08/22 05:53:47] ppocr INFO: epoch: [16/1200], global_step: 251, lr: 0.001000, loss: 5.013463, loss_shrink_maps: 3.594380, loss_threshold_maps: 0.917541, loss_binary_maps: 0.499983, avg_reader_cost: 0.04486 s, avg_batch_cost: 0.73142 s, avg_samples: 16.0, ips: 21.87528 samples/s, eta: 5:35:56 [2022/08/22 05:53:47] ppocr INFO: epoch: [16/1200], global_step: 252, lr: 0.001000, loss: 5.014532, loss_shrink_maps: 3.594380, loss_threshold_maps: 0.917541, loss_binary_maps: 0.491992, avg_reader_cost: 0.00012 s, avg_batch_cost: 0.18831 s, avg_samples: 16.0, ips: 84.96449 samples/s, eta: 5:34:49 Found inf or nan, current scale is: 5.048709793414476e-29, decrease to: 5.048709793414476e-290.5 [2022/08/22 05:53:48] ppocr INFO: epoch: [16/1200], global_step: 253, lr: 0.001000, loss: 4.997432, loss_shrink_maps: 3.569089, loss_threshold_maps: 0.917541, loss_binary_maps: 0.491992, avg_reader_cost: 0.00012 s, avg_batch_cost: 0.18704 s, avg_samples: 16.0, ips: 85.54381 samples/s, eta: 5:33:43 [2022/08/22 05:53:48] ppocr INFO: epoch: [16/1200], global_step: 254, lr: 0.001000, loss: 5.014532, loss_shrink_maps: 3.594380, loss_threshold_maps: 0.911827, loss_binary_maps: 0.491992, avg_reader_cost: 0.00010 s, avg_batch_cost: 0.18920 s, avg_samples: 16.0, ips: 84.56503 samples/s, eta: 5:32:37 Found inf or nan, current scale is: 2.524354896707238e-29, decrease to: 2.524354896707238e-290.5 [2022/08/22 05:53:48] ppocr INFO: epoch: [16/1200], global_step: 255, lr: 0.001000, loss: 4.968390, loss_shrink_maps: 3.555958, loss_threshold_maps: 0.911827, loss_binary_maps: 0.490204, avg_reader_cost: 0.00010 s, avg_batch_cost: 0.20171 s, avg_samples: 16.0, ips: 79.31993 samples/s, eta: 5:31:33 [2022/08/22 05:53:48] ppocr INFO: epoch: [16/1200], global_step: 256, lr: 0.001000, loss: 4.968390, loss_shrink_maps: 3.555958, loss_threshold_maps: 0.908511, loss_binary_maps: 0.490204, avg_reader_cost: 0.00008 s, avg_batch_cost: 0.13288 s, avg_samples: 10.0, ips: 75.25305 samples/s, eta: 5:30:24 [2022/08/22 05:53:48] ppocr INFO: save model in ./output/db_mv3/latest Found inf or nan, current scale is: 1.262177448353619e-29, decrease to: 1.262177448353619e-290.5 [2022/08/22 05:53:58] ppocr INFO: epoch: [17/1200], global_step: 257, lr: 0.001000, loss: 4.919101, loss_shrink_maps: 3.525966, loss_threshold_maps: 0.908511, loss_binary_maps: 0.484178, avg_reader_cost: 7.46706 s, avg_batch_cost: 9.93894 s, avg_samples: 16.0, ips: 1.60983 samples/s, eta: 5:41:18 [2022/08/22 05:54:00] ppocr INFO: epoch: [17/1200], global_step: 258, lr: 0.001000, loss: 4.919101, loss_shrink_maps: 3.513160, loss_threshold_maps: 0.911827, loss_binary_maps: 0.484178, avg_reader_cost: 0.76209 s, avg_batch_cost: 1.93431 s, avg_samples: 16.0, ips: 8.27168 samples/s, eta: 5:42:20 Found inf or nan, current scale is: 6.310887241768095e-30, decrease to: 6.310887241768095e-300.5 [2022/08/22 05:54:01] ppocr INFO: epoch: [17/1200], global_step: 259, lr: 0.001000, loss: 4.958037, loss_shrink_maps: 3.525966, loss_threshold_maps: 0.917541, loss_binary_maps: 0.490204, avg_reader_cost: 0.00026 s, avg_batch_cost: 0.90969 s, avg_samples: 16.0, ips: 17.58848 samples/s, eta: 5:42:06 [2022/08/22 05:54:02] ppocr INFO: epoch: [17/1200], global_step: 260, lr: 0.001000, loss: 4.958037, loss_shrink_maps: 3.525966, loss_threshold_maps: 0.924393, loss_binary_maps: 0.490204, avg_reader_cost: 0.00015 s, avg_batch_cost: 0.52780 s, avg_samples: 16.0, ips: 30.31429 samples/s, eta: 5:41:25 Found inf or nan, current scale is: 3.1554436208840472e-30, decrease to: 3.1554436208840472e-300.5 [2022/08/22 05:54:02] ppocr INFO: epoch: [17/1200], global_step: 261, lr: 0.001000, loss: 4.958037, loss_shrink_maps: 3.525966, loss_threshold_maps: 0.929245, loss_binary_maps: 0.491992, avg_reader_cost: 0.00016 s, avg_batch_cost: 0.38202 s, avg_samples: 16.0, ips: 41.88283 samples/s, eta: 5:40:33 [2022/08/22 05:54:02] ppocr INFO: epoch: [17/1200], global_step: 262, lr: 0.001000, loss: 4.958037, loss_shrink_maps: 3.513160, loss_threshold_maps: 0.929245, loss_binary_maps: 0.491992, avg_reader_cost: 0.00017 s, avg_batch_cost: 0.28636 s, avg_samples: 16.0, ips: 55.87395 samples/s, eta: 5:39:34 Found inf or nan, current scale is: 1.5777218104420236e-30, decrease to: 1.5777218104420236e-300.5 [2022/08/22 05:54:03] ppocr INFO: epoch: [17/1200], global_step: 263, lr: 0.001000, loss: 4.988148, loss_shrink_maps: 3.513160, loss_threshold_maps: 0.931860, loss_binary_maps: 0.497072, avg_reader_cost: 0.00023 s, avg_batch_cost: 0.48528 s, avg_samples: 16.0, ips: 32.97056 samples/s, eta: 5:38:51 [2022/08/22 05:54:03] ppocr INFO: epoch: [17/1200], global_step: 264, lr: 0.001000, loss: 4.988148, loss_shrink_maps: 3.513160, loss_threshold_maps: 0.935539, loss_binary_maps: 0.502491, avg_reader_cost: 0.00015 s, avg_batch_cost: 0.17870 s, avg_samples: 16.0, ips: 89.53567 samples/s, eta: 5:37:45 Found inf or nan, current scale is: 7.888609052210118e-31, decrease to: 7.888609052210118e-310.5 [2022/08/22 05:54:04] ppocr INFO: epoch: [17/1200], global_step: 265, lr: 0.001000, loss: 4.988148, loss_shrink_maps: 3.507436, loss_threshold_maps: 0.935539, loss_binary_maps: 0.502491, avg_reader_cost: 0.00017 s, avg_batch_cost: 0.93466 s, avg_samples: 16.0, ips: 17.11847 samples/s, eta: 5:37:35

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.