Open RaymondLi0 opened 1 year ago
With micro-batch-size=2, global-batch-size=192, a 1B-model configuration, the UL2 training script gives:
forward-backward ...............................: (5164.84, 5177.15) forward-compute ................................: (2584.92, 2735.66) backward-compute ...............................: (2423.13, 2586.40) batch-generator ................................: (408.98, 817.78) <--- data-iterator ..................................: (3.90, 43.10) broadcast-data .................................: (395.41, 776.42) <--- layernorm-grads-all-reduce .....................: (0.02, 0.03) embedding-grads-all-reduce .....................: (0.03, 0.04) grads-all-reduce ...............................: (193.33, 193.63) optimizer-copy-to-main-grad ....................: (10.61, 10.67) optimizer-unscale-and-check-inf ................: (36.12, 36.37) optimizer-clip-main-grad .......................: (2.90, 3.00) optimizer-count-zeros ..........................: (0.00, 0.01) optimizer-inner-step ...........................: (17.63, 17.73) optimizer-copy-main-to-model-params ............: (4.63, 4.72) optimizer ......................................: (72.92, 73.13)
where the GPT training script with the same configuration has twice shorter forward-time, coming mostly from having almost zero broadcast-data time. https://github.com/bigcode-project/Megatron-LM/blob/ul2-merge/pretrain_ul2.py#L90
With micro-batch-size=2, global-batch-size=192, a 1B-model configuration, the UL2 training script gives:
where the GPT training script with the same configuration has twice shorter forward-time, coming mostly from having almost zero broadcast-data time. https://github.com/bigcode-project/Megatron-LM/blob/ul2-merge/pretrain_ul2.py#L90