Closed amirrezarajabi closed 1 year ago
Hi, I think there may be a misunderstanding here. After the attention, the features will be of shape [B, N, C]
. Once the it goes through a ToMe module, that becomes [B, N-r, C]
. The MLP afterward can accept any number of tokens so after it passes through the MLP, it'll still be [B, N-r, C]
.
If you tested this manually and didn't see a reduction, perhaps you didn't set r
properly? The default r
after a patch is 0 (no reduction). To set model, use patched_model.r = #
.
thank you for your response
so the only modules that benefit from having less tokens are the MLP
blocks in transformer blocks and all the attention blocks still have to fill the reduced tokens with paddings since they have constant max_len.
correct me if i am wrong.
Not quite: most attention implementations support any number of tokens. It's just that the number of tokens per image in the batch need to be the same. Since we reduce a constant number of tokens per image, this is not a problem.
So when we reduce by r
tokens in this block, the MLP for this block, and the Attn for next block gets N-r
tokens instead of N
tokens. Then the block after that would get N-2r
, etc. all the way down to just a couple tokens total.
you are right thanks a lot. great work btw good luck.
Hi , I understood that after tokens come from the attention module, you feed them into a ToMe Block. Afterwards, number of tokens become N-r . But if you haven't changed dimension of input of MLP(that comes after ToMe block) to N-r, How exactly you can claim that you are reducing tokens. And how this can help increasing throughput if there are no changes in dimension of your model after and before modifying it?