About Self-attn - Githubissues

AngeLouCN / CaraNet

Context Axial Reverse Attention Network for Small Medical Objects Segmentation

462 stars 30 forks source link

About Self-attn #8

Closed ZeriLinux closed 2 years ago

ZeriLinux commented 2 years ago

Regarding the use of self-attention. Before the activation function uses sigmoid, do you need to reduce d**-0.5, because Transformer uses softmax, and it is scaled before operation.

AngeLouCN commented 2 years ago

I not really sure should we need to subtract 0.5, because you can also some repositories which do not do this operation.

Asthestarsfalll commented 2 years ago

@ZeriLinux
The scale reduction of d**-0.5 is due to the feature of softmax. Once the variance of input is too big, the output of softmax will be a one-hot vector. Maybe the sigmoid does't need reduce scale operation.