Why permute NLD to LND shape?

bytedance / 1d-tokenizer

This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation

Apache License 2.0

404 stars 16 forks source link

Closed tau-yihouxiang closed 3 months ago

tau-yihouxiang commented 3 months ago

Why permute x's shape from NLD to LND? This is different from the principle, which means attention in batch channel of each word.

x = x.permute(1, 0, 2)  # NLD -> LND
for i in range(self.num_layers):
    x = self.transformer[i](x)
x = x.permute(1, 0, 2)  # LND -> NLD

MaxxP0 commented 3 months ago

because they dont use MultiHeadAttention with batch_first=True.

tau-yihouxiang commented 3 months ago

Thanks~