NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

What is padding_causal? #908

Open 1049451037 opened 3 weeks ago

1049451037 commented 3 weeks ago

What is the meaning of padding_causal? Now that we are using causal, the tokens in the left cannot see the tokens in the right. So the paddings are not necessarily to be distinguished in the attention mask. Why do we need padding_causal?

Because I met the issue in the transformer_engine/pytorch/attention.py when I pass in causal.

https://github.com/NVIDIA/TransformerEngine/blob/086a12fe9cfade2d49eaa388b991d397e8168477/transformer_engine/pytorch/attention.py#L3720-L3722