Lower throughput with UL2 training

With micro-batch-size=2, global-batch-size=192, a 1B-model configuration, the UL2 training script gives:

    forward-backward ...............................: (5164.84, 5177.15)
    forward-compute ................................: (2584.92, 2735.66)
    backward-compute ...............................: (2423.13, 2586.40)
    batch-generator ................................: (408.98, 817.78)  <---
    data-iterator ..................................: (3.90, 43.10)
    broadcast-data .................................: (395.41, 776.42)  <---
    layernorm-grads-all-reduce .....................: (0.02, 0.03)
    embedding-grads-all-reduce .....................: (0.03, 0.04)
    grads-all-reduce ...............................: (193.33, 193.63)
    optimizer-copy-to-main-grad ....................: (10.61, 10.67)
    optimizer-unscale-and-check-inf ................: (36.12, 36.37)
    optimizer-clip-main-grad .......................: (2.90, 3.00)
    optimizer-count-zeros ..........................: (0.00, 0.01)
    optimizer-inner-step ...........................: (17.63, 17.73)
    optimizer-copy-main-to-model-params ............: (4.63, 4.72)
    optimizer ......................................: (72.92, 73.13)

where the GPT training script with the same configuration has twice shorter forward-time, coming mostly from having almost zero broadcast-data time. https://github.com/bigcode-project/Megatron-LM/blob/ul2-merge/pretrain_ul2.py#L90

bigcode-project / Megatron-LM

Lower throughput with UL2 training #28