Closed tau-yihouxiang closed 3 months ago
Why permute x's shape from NLD to LND? This is different from the principle, which means attention in batch channel of each word.
x = x.permute(1, 0, 2) # NLD -> LND for i in range(self.num_layers): x = self.transformer[i](x) x = x.permute(1, 0, 2) # LND -> NLD
because they dont use MultiHeadAttention with batch_first=True.
Thanks~
Why permute x's shape from NLD to LND? This is different from the principle, which means attention in batch channel of each word.