OliverRensu / Shunted-Transformer

205 stars 20 forks source link

Why you discard to token-to-token attention in your model #21

Open leoozy opened 1 year ago

leoozy commented 1 year ago

Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N*N attention in your model? For example, you can split it into three modes. Thank you.

OliverRensu commented 1 year ago

Theoretically, we can split H heads into H modes. The key is how to choose the down-sampling rate $r$. For example, we have two modes and choose r=4,8 at stage 1. We can take four modes and r=1,2,4,8 C=64 head=4. r=1 is corresponding to the original NN attention. However, the memory consumption and computation cost (especially for large-size inputs like 512X512 in segmentation and 1000X1000 in detection) are unacceptable. The smaller r is, the heavy the computation cost is. In stage 3 which computation cost of NN attention (N=H /16 *W/16) is affordable, we take r=1 and keep the original attention.

leoozy @.***> 于2022年11月26日周六 10:10写道:

Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original N*N attention in your model? For example, you can split it into three modes. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/OliverRensu/Shunted-Transformer/issues/21, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANP7CCEPXBXDW7IPAQX7H7DWKFWRFANCNFSM6AAAAAASLXU374 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

leoozy commented 1 year ago

Theoretically, we can split H heads into H modes. The key is how to choose the down-sampling rate $r$. For example, we have two modes and choose r=4,8 at stage 1. We can take four modes and r=1,2,4,8 C=64 head=4. r=1 is corresponding to the original NN attention. However, the memory consumption and computation cost (especially for large-size inputs like 512X512 in segmentation and 1000X1000 in detection) are unacceptable. The smaller r is, the heavy the computation cost is. In stage 3 which computation cost of NN attention (N=H /16 *W/16) is affordable, we take r=1 and keep the original attention. leoozy @.> 于2022年11月26日周六 10:10写道: Dears, thank you for your work. I found that in your model, you split the attention into two modes: (<num_head/2 ; > num_head/2). And all these two modes use the compressed sequence. Why you discard the original NN attention in your model? For example, you can split it into three modes. Thank you. — Reply to this email directly, view it on GitHub <#21>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANP7CCEPXBXDW7IPAQX7H7DWKFWRFANCNFSM6AAAAAASLXU374 . You are receiving this because you are subscribed to this thread.Message ID: **@.***>

Thank you for your rapid reply. Your work is excellent but I still have some confusion about how to design such a architecture.

  1. I found that in Equ 1: image The weight W^k, W^v are different from heads to heads. But in the trandition VIT, the weight W^k, W^v are shared among heads. Do this will lead to more parameters if I want to use more modes?

  2. In your architecture, you use more parameters than vit (e.g. conv2d operation). Am I right?

OliverRensu commented 1 year ago
  1. In ViT, W is also different for different heads, but implemented by one linear layer which makes it similar to shared weight. For example, W is (, 512) for 8 heads and there are 8 W (, 64) but implemented by one layer
  2. We take similar parameters and computation costs (a little more or a little fewer) with previous ViT and its variants