Closed fcampagnexandr closed 3 years ago
Cast k andf values to float, cast back output to type of input, avoid instability as attention weights grow during training, per https://twitter.com/tsuname/status/1430653484827697155?s=20
Cast k andf values to float, cast back output to type of input, avoid instability as attention weights grow during training, per https://twitter.com/tsuname/status/1430653484827697155?s=20