Closed ZeriLinux closed 2 years ago
I not really sure should we need to subtract 0.5, because you can also some repositories which do not do this operation.
@ZeriLinux
The scale reduction of d**-0.5 is due to the feature of softmax. Once the variance of input is too big, the output of softmax will be a one-hot vector. Maybe the sigmoid does't need reduce scale operation.
Regarding the use of self-attention. Before the activation function uses sigmoid, do you need to reduce d**-0.5, because Transformer uses softmax, and it is scaled before operation.