A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
What is the meaning of padding_causal? Now that we are using causal, the tokens in the left cannot see the tokens in the right. So the paddings are not necessarily to be distinguished in the attention mask. Why do we need padding_causal?
Because I met the issue in the transformer_engine/pytorch/attention.py when I pass in causal.
What is the meaning of padding_causal? Now that we are using causal, the tokens in the left cannot see the tokens in the right. So the paddings are not necessarily to be distinguished in the attention mask. Why do we need padding_causal?
Because I met the issue in the
transformer_engine/pytorch/attention.py
when I pass incausal
.https://github.com/NVIDIA/TransformerEngine/blob/086a12fe9cfade2d49eaa388b991d397e8168477/transformer_engine/pytorch/attention.py#L3720-L3722