Training process occurs nan at the first ten batch.

powermano commented 2 years ago

envs: Mxnet1.8 fp16 resnet50, byteps 0.2.5.post15. Training process occurs the nan, as shown in the following: ( changing lr from 0.1 to 0.001, the nan disappear, But the loss seems not able to decrease.)

2021-08-12 05:44:02,294 Epoch[1] Batch[1] Speed: 2.38 samples/sec,  IDLoss=45.729,
2021-08-12 05:44:02,654 Epoch[1] Batch[2] Speed: 606.44 samples/sec,  IDLoss=46.012,
2021-08-12 05:44:03,550 Epoch[1] Batch[3] Speed: 142.99 samples/sec,  IDLoss=45.780,
2021-08-12 05:44:04,426 Epoch[1] Batch[4] Speed: 146.38 samples/sec,  IDLoss=60.220,
2021-08-12 05:44:05,311 Epoch[1] Batch[5] Speed: 229.28 samples/sec,  IDLoss=61.163,
2021-08-12 05:44:06,251 Epoch[1] Batch[6] Speed: 136.34 samples/sec,  IDLoss=70.405,
2021-08-12 05:44:07,141 Epoch[1] Batch[7] Speed: 143.86 samples/sec,  IDLoss=nan,
2021-08-12 05:44:07,971 Epoch[1] Batch[8] Speed: 156.27 samples/sec,  IDLoss=nan,
2021-08-12 05:44:08,882 Epoch[1] Batch[9] Speed: 140.62 samples/sec,  IDLoss=nan,

But when I use fp32, the nan disappear, Is there any problems with byteps fp16.

powermano commented 2 years ago

@ymjiang @bobzhuyb

powermano commented 2 years ago

Do not use byteps, the code training process works.

bytedance / byteps

Training process occurs nan at the first ten batch. #411