bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

Broke causal masking for gpt_model with used with deepspeed without fused kernels #305

Closed thomasw21 closed 2 years ago

thomasw21 commented 2 years ago

https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/304#issuecomment-1176182316