How exactly tokens are reduced while there is no change in your model dimensions before and after tome.patch

facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.

Other

970 stars 69 forks source link

How exactly tokens are reduced while there is no change in your model dimensions before and after tome.patch #22

Closed amirrezarajabi closed 1 year ago

amirrezarajabi commented 1 year ago

Hi , I understood that after tokens come from the attention module, you feed them into a ToMe Block. Afterwards, number of tokens become N-r . But if you haven't changed dimension of input of MLP(that comes after ToMe block) to N-r, How exactly you can claim that you are reducing tokens. And how this can help increasing throughput if there are no changes in dimension of your model after and before modifying it?

dbolya commented 1 year ago

Hi, I think there may be a misunderstanding here. After the attention, the features will be of shape [B, N, C]. Once the it goes through a ToMe module, that becomes [B, N-r, C]. The MLP afterward can accept any number of tokens so after it passes through the MLP, it'll still be [B, N-r, C].

If you tested this manually and didn't see a reduction, perhaps you didn't set r properly? The default r after a patch is 0 (no reduction). To set model, use patched_model.r = #.

amirrezarajabi commented 1 year ago

thank you for your response so the only modules that benefit from having less tokens are the MLP blocks in transformer blocks and all the attention blocks still have to fill the reduced tokens with paddings since they have constant max_len. correct me if i am wrong.

dbolya commented 1 year ago

Not quite: most attention implementations support any number of tokens. It's just that the number of tokens per image in the batch need to be the same. Since we reduce a constant number of tokens per image, this is not a problem.

So when we reduce by r tokens in this block, the MLP for this block, and the Attn for next block gets N-r tokens instead of N tokens. Then the block after that would get N-2r, etc. all the way down to just a couple tokens total.

amirrezarajabi commented 1 year ago

you are right thanks a lot. great work btw good luck.