Closed tumbleintoyourheart closed 4 years ago
Hi, actually my model implementation strictly follows the one in paper.
If you look into PyTorch's TransformerEncoderLayer implementation, you will find it is in the order: self_attn->residual->norm->pointwise_ff->residual->norm
. However in End-to-End Neural Speaker Diarization with Self-attention/Fig. 2, the encoder block is defined as: norm->self_attn->residual->norm->pointwise_ff->residual
, and with a layer_norm at the end (before linear+sigmoid).
Thus, applying layer_norm at the beginning in PyTorch code is equal to what they have done in their paper.
Yeah, then everything makes sense. What a neat adaptation, thanks. Closing this now.
Hi,
In End-to-End Neural Speaker Diarization with Self-attention/Fig. 2., LayerNorm was applied after the
Encoder block
s, but in your implementation, the order was reversed. Are there any particular reasons for that?Have a good day.