I would like to ask why it is necessary to have a two-way Self-Attention. As far as I know, in Transformer, the MHSA of the Encoder part is a two-way concern, refer to BERT. The Mask MHSA in the Decoder part is to cover up the future information and only focus on the past information, refer to GPT. And in your code, the input to Transformer is not transposed, but it is standard practice to transform [batch,channel,length] to [batch,length,channel].
I would like to ask why it is necessary to have a two-way Self-Attention. As far as I know, in Transformer, the MHSA of the Encoder part is a two-way concern, refer to BERT. The Mask MHSA in the Decoder part is to cover up the future information and only focus on the past information, refer to GPT. And in your code, the input to Transformer is not transposed, but it is standard practice to transform [batch,channel,length] to [batch,length,channel].