Open Superkeyv opened 1 month ago
DotProductAttention implementation multiplies the wrong scaling factor
This PR provider a simple fix
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/dot_product_attention.py#L67-L81
DotProductAttention implementation multiplies the wrong scaling factor
This PR provider a simple fix
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/dot_product_attention.py#L67-L81