bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.32k stars 214 forks source link

Is this assertion for mask wrong? #400

Open yinfangchen opened 7 months ago

yinfangchen commented 7 months ago

I got an AssertionError: Mask is silently ignored due to the use of a custom kernel when training GPT-2 with examples/pretrain_gpt.sh.

This line leads to the assertion error: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/8387ae17c4704f6579f88a84500b535d19d7fbbf/megatron/model/fused_softmax.py#L191

Is this assertion necessary? And is it even correct?

LordEdison commented 5 months ago

same puzzlement