bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

Fix causal attention mask #306

Closed thomasw21 closed 2 years ago

thomasw21 commented 2 years ago

Fixes: #305

When refactoring some code, I accidentally broke the notion of causal masking for non fused kernels. I've fixed it + implemented a test to make sure we don't ever break it again