NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.48k stars 2.14k forks source link

T5 pretraining grows slower if micro batch size > 1 #184

Open cryoco opened 2 years ago

cryoco commented 2 years ago
 iteration       10/    1000 | consumed samples:          160 | elapsed time per iteration (ms): 4847.7 | learning rate: 1.000E-04 | global batch size:    16 | loss scale: 8388608.0 | number of skipped iterations:  10 | number of nan iterations:   0 |
 iteration       20/    1000 | consumed samples:          320 | elapsed time per iteration (ms): 5375.5 | learning rate: 1.000E-04 | global batch size:    16 | lm loss: 1.031812E+01 | loss scale: 16384.0 | number of skipped iterations:   9 | number of nan iterations:   0 |
 iteration       30/    1000 | consumed samples:          480 | elapsed time per iteration (ms): 6150.0 | learning rate: 1.000E-04 | global batch size:    16 | lm loss: 1.184227E+01 | loss scale: 16384.0 | grad norm: 16.206 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       40/    1000 | consumed samples:          640 | elapsed time per iteration (ms): 7027.5 | learning rate: 1.000E-04 | global batch size:    16 | lm loss: 7.400169E+00 | loss scale: 16384.0 | grad norm: 7.939 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       50/    1000 | consumed samples:          800 | elapsed time per iteration (ms): 7873.5 | learning rate: 1.000E-04 | global batch size:    16 | lm loss: 6.508469E+00 | loss scale: 16384.0 | grad norm: 4.846 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       60/    1000 | consumed samples:          960 | elapsed time per iteration (ms): 8804.5 | learning rate: 1.000E-04 | global batch size:    16 | lm loss: 6.107175E+00 | loss scale: 16384.0 | grad norm: 5.190 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration       70/    1000 | consumed samples:         1120 | elapsed time per iteration (ms): 9476.2 | learning rate: 1.000E-04 | global batch size:    16 | lm loss: 5.975057E+00 | loss scale: 16384.0 | grad norm: 3.755 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | model-and-optimizer-setup: 1736.32 | train/valid/test-data-iterators-setup: 2674.95
time (ms) | forward-compute: 604.56 | backward-compute: 701.16 | backward-params-all-reduce: 2672.65 | backward-embedding-all-reduce: 0.13 | optimizer-copy-to-main-grad: 21.18 | optimizer-unscale-and-check-inf: 835.71 | optimizer: 857.27
time (ms) | forward-compute: 864.01 | backward-compute: 971.21 | backward-params-all-reduce: 2805.27 | backward-embedding-all-reduce: 0.09 | optimizer-copy-to-main-grad: 12.70 | optimizer-unscale-and-check-inf: 498.17 | optimizer-clip-main-grad: 1.55 | optimizer-copy-main-to-model-params: 1.47 | optimizer: 724.70
time (ms) | forward-compute: 1304.65 | backward-compute: 1428.05 | backward-params-all-reduce: 3067.17 | backward-embedding-all-reduce: 0.04 | optimizer-copy-to-main-grad: 10.33 | optimizer-unscale-and-check-inf: 277.17 | optimizer-clip-main-grad: 15.31 | optimizer-copy-main-to-model-params: 10.97 | optimizer: 341.30
time (ms) | forward-compute: 1664.07 | backward-compute: 1881.06 | backward-params-all-reduce: 3248.30 | backward-embedding-all-reduce: 0.05 | optimizer-copy-to-main-grad: 10.96 | optimizer-unscale-and-check-inf: 161.09 | optimizer-clip-main-grad: 15.23 | optimizer-copy-main-to-model-params: 10.95 | optimizer: 225.78
time (ms) | forward-compute: 2120.17 | backward-compute: 2327.50 | backward-params-all-reduce: 3114.42 | backward-embedding-all-reduce: 0.04 | optimizer-copy-to-main-grad: 10.13 | optimizer-unscale-and-check-inf: 238.10 | optimizer-clip-main-grad: 15.77 | optimizer-copy-main-to-model-params: 10.94 | optimizer: 302.45
time (ms) | forward-compute: 2709.59 | backward-compute: 2703.16 | backward-params-all-reduce: 2872.23 | backward-embedding-all-reduce: 0.04 | optimizer-copy-to-main-grad: 30.98 | optimizer-unscale-and-check-inf: 426.58 | optimizer-clip-main-grad: 15.80 | optimizer-copy-main-to-model-params: 10.95 | optimizer: 511.77
time (ms) | forward-compute: 3070.24 | backward-compute: 3031.89 | backward-params-all-reduce: 2533.92 | backward-embedding-all-reduce: 0.06 | optimizer-copy-to-main-grad: 10.99 | optimizer-unscale-and-check-inf: 765.58 | optimizer-clip-main-grad: 15.78 | optimizer-copy-main-to-model-params: 11.08 | optimizer: 831.14

It seems that forward and backward computing grows slower. Please take a look. Thanks!

lengstrom commented 2 years ago

Also having this issue -- were you able to fix this problem?

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 10 months ago

Marking as stale. No activity in 60 days.