A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Seems like at the current time packed tensors in thd format are not supported by transformer_engine.pytorch.attention. DotProductAttention. That's weird as such mode clearly supported by fused_attn_fwd from fused_attn cpp_extensions
I see that FusedAttnFunc was used in FusedAttention, but implementations for FusedAttnFunc_kvpacked and FusedAttnFunc_qkvpacked are not present. I suppose, they could be added in a same way
Seems like at the current time packed tensors in
thd
format are not supported bytransformer_engine.pytorch.attention. DotProductAttention
. That's weird as such mode clearly supported byfused_attn_fwd
from fused_attn cpp_extensionsI see that
FusedAttnFunc
was used inFusedAttention
, but implementations forFusedAttnFunc_kvpacked
andFusedAttnFunc_qkvpacked
are not present. I suppose, they could be added in a same way