@eric-weiss-zyphra discovered that upstream Megatron-LM is still on the old dataloader scheme (as opposed to gpt-neox), leading to overflow errors like:
File "torch/utils/data/_utils/collate.py", line 141, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [198705720] at entry 16
I created a solution for this a while back in https://github.com/EleutherAI/gpt-neox/pull/835, which we should apply to Megatron-LM, test that it works, and then contribute back to upstream
@eric-weiss-zyphra discovered that upstream Megatron-LM is still on the old dataloader scheme (as opposed to gpt-neox), leading to overflow errors like:
I created a solution for this a while back in https://github.com/EleutherAI/gpt-neox/pull/835, which we should apply to Megatron-LM, test that it works, and then contribute back to upstream