facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.
Other
931 stars 67 forks source link

How to implement ToMe to image encoders in CLIP model? #36

Closed Bostoncake closed 3 months ago

Bostoncake commented 11 months ago

I tried to implement ToMe into the image encoder in CLIP model. However, the ViT in CLIP uses nn.MultiheadAttention, which I couldn't modify the forward process. I wonder if you have any ideas on how to implement ToMe to original CLIP models? Thanks!

dbolya commented 11 months ago

There's currently a PR for this: #21. The long and short of it is that editing the attn layer is not necessary--it just improves performance. You can try without that modification.