Broke causal masking for gpt_model with used with deepspeed without fused kernels

bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.31k stars 213 forks source link

Closed thomasw21 closed 2 years ago

thomasw21 commented 2 years ago