Question about the difference between Transformer implementation vs original architecture in the paper.

tumbleintoyourheart commented 4 years ago

Hi,

In End-to-End Neural Speaker Diarization with Self-attention/Fig. 2., LayerNorm was applied after the Encoder blocks, but in your implementation, the order was reversed. Are there any particular reasons for that?

Have a good day.

Xflick commented 4 years ago

Hi, actually my model implementation strictly follows the one in paper.

If you look into PyTorch's TransformerEncoderLayer implementation, you will find it is in the order: self_attn->residual->norm->pointwise_ff->residual->norm. However in End-to-End Neural Speaker Diarization with Self-attention/Fig. 2, the encoder block is defined as: norm->self_attn->residual->norm->pointwise_ff->residual, and with a layer_norm at the end (before linear+sigmoid).

Thus, applying layer_norm at the beginning in PyTorch code is equal to what they have done in their paper.

tumbleintoyourheart commented 4 years ago

Yeah, then everything makes sense. What a neat adaptation, thanks. Closing this now.

Xflick / EEND_PyTorch

Question about the difference between Transformer implementation vs original architecture in the paper. #2