Zyphra / Megatron-LM

Ongoing research training transformer models at scale
Other
0 stars 0 forks source link

[BUG] Dataloader Overflow Errors #3

Open Quentin-Anthony opened 1 year ago

Quentin-Anthony commented 1 year ago

@eric-weiss-zyphra discovered that upstream Megatron-LM is still on the old dataloader scheme (as opposed to gpt-neox), leading to overflow errors like:

File "torch/utils/data/_utils/collate.py", line 141, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [2049] at entry 0 and [198705720] at entry 16

I created a solution for this a while back in https://github.com/EleutherAI/gpt-neox/pull/835, which we should apply to Megatron-LM, test that it works, and then contribute back to upstream

Quentin-Anthony commented 12 months ago

We haven't been able to reproduce this in a while, so deprioritizing for now.