[BUG] wrong scale softmax for local transformer implement

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Other

9.23k stars 2.08k forks source link

Open Superkeyv opened 1 month ago

Superkeyv commented 1 month ago

DotProductAttention implementation multiplies the wrong scaling factor

This PR provider a simple fix