with micro_batch_size=1, global_batch_size=8, the throughput seems stable:
iteration 10/ 1000 | consumed samples: 80 | elapsed time per iteration (ms): 4196.4 | learning rate: 1.000E-04 | global batch size: 8 | loss scale: 8388608.0 | number of skipped iterations: 10 | number of nan iterations: 0 |
iteration 20/ 1000 | consumed samples: 160 | elapsed time per iteration (ms): 3997.8 | learning rate: 1.000E-04 | global batch size: 8 | lm loss: 1.173781E+01 | loss scale: 32768.0 | grad norm: 84.084 | number of skipped iterations: 8 | number of nan iterations: 0 |
iteration 30/ 1000 | consumed samples: 240 | elapsed time per iteration (ms): 3892.9 | learning rate: 1.000E-04 | global batch size: 8 | lm loss: 1.171586E+01 | loss scale: 16384.0 | grad norm: 31.831 | number of skipped iterations: 1 | number of nan iterations: 0 |
iteration 40/ 1000 | consumed samples: 320 | elapsed time per iteration (ms): 3827.2 | learning rate: 1.000E-04 | global batch size: 8 | lm loss: 7.473515E+00 | loss scale: 16384.0 | grad norm: 7.100 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 50/ 1000 | consumed samples: 400 | elapsed time per iteration (ms): 3738.2 | learning rate: 1.000E-04 | global batch size: 8 | lm loss: 6.682758E+00 | loss scale: 16384.0 | grad norm: 8.118 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 60/ 1000 | consumed samples: 480 | elapsed time per iteration (ms): 3853.5 | learning rate: 1.000E-04 | global batch size: 8 | lm loss: 6.173346E+00 | loss scale: 16384.0 | grad norm: 5.354 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 70/ 1000 | consumed samples: 560 | elapsed time per iteration (ms): 3822.8 | learning rate: 1.000E-04 | global batch size: 8 | lm loss: 5.929058E+00 | loss scale: 16384.0 | grad norm: 7.846 | number of skipped iterations: 0 | number of nan iterations: 0 |
But if I change micro_batch_size to 2 and global_batch_size to 16, the throughput drops during training:
iteration 10/ 1000 | consumed samples: 160 | elapsed time per iteration (ms): 4847.7 | learning rate: 1.000E-04 | global batch size: 16 | loss scale: 8388608.0 | number of skipped iterations: 10 | number of nan iterations: 0 |
iteration 20/ 1000 | consumed samples: 320 | elapsed time per iteration (ms): 5375.5 | learning rate: 1.000E-04 | global batch size: 16 | lm loss: 1.031812E+01 | loss scale: 16384.0 | number of skipped iterations: 9 | number of nan iterations: 0 |
iteration 30/ 1000 | consumed samples: 480 | elapsed time per iteration (ms): 6150.0 | learning rate: 1.000E-04 | global batch size: 16 | lm loss: 1.184227E+01 | loss scale: 16384.0 | grad norm: 16.206 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 40/ 1000 | consumed samples: 640 | elapsed time per iteration (ms): 7027.5 | learning rate: 1.000E-04 | global batch size: 16 | lm loss: 7.400169E+00 | loss scale: 16384.0 | grad norm: 7.939 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 50/ 1000 | consumed samples: 800 | elapsed time per iteration (ms): 7873.5 | learning rate: 1.000E-04 | global batch size: 16 | lm loss: 6.508469E+00 | loss scale: 16384.0 | grad norm: 4.846 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 60/ 1000 | consumed samples: 960 | elapsed time per iteration (ms): 8804.5 | learning rate: 1.000E-04 | global batch size: 16 | lm loss: 6.107175E+00 | loss scale: 16384.0 | grad norm: 5.190 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 70/ 1000 | consumed samples: 1120 | elapsed time per iteration (ms): 9476.2 | learning rate: 1.000E-04 | global batch size: 16 | lm loss: 5.975057E+00 | loss scale: 16384.0 | grad norm: 3.755 | number of skipped iterations: 0 | number of nan iterations: 0 |
with
micro_batch_size
=1,global_batch_size
=8, the throughput seems stable:But if I change
micro_batch_size
to 2 andglobal_batch_size
to 16, the throughput drops during training:It seems that forward and backward computing grows slower. Please take a look. Thanks!