[BUG] Dataloader Overflow Errors

@eric-weiss-zyphra discovered that upstream Megatron-LM is still on the old dataloader scheme (as opposed to gpt-neox), leading to overflow errors like:

File "torch/utils/data/_utils/collate.py", line 141, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [198705720] at entry 16

I created a solution for this a while back in https://github.com/EleutherAI/gpt-neox/pull/835, which we should apply to Megatron-LM, test that it works, and then contribute back to upstream

Zyphra / Megatron-LM

[BUG] Dataloader Overflow Errors #3