NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUG] wrong scale softmax for local transformer implement #848

Open Superkeyv opened 1 month ago

Superkeyv commented 1 month ago

DotProductAttention implementation multiplies the wrong scaling factor

This PR provider a simple fix

https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/dot_product_attention.py#L67-L81