facebookresearch / ToMe

A method to increase the speed and lower the memory footprint of existing vision transformers.
Other
931 stars 67 forks source link

About inference acceleration for multi-modal large language models #38

Closed JoaquinChou closed 5 months ago

JoaquinChou commented 5 months ago

Hello, I would like to ask if Token Merging method could be used in accelerate inference on current multi-modal large language models such as LLAVA and VILA?

dbolya commented 5 months ago

It should work for the vision encoder out of the box (assuming it uses ViT), but applying it to the language model is an open research question right now. The problem is that language models use causal attention, so you can't just merge tokens arbitrarily.

JoaquinChou commented 5 months ago

Thanks for your answer. After reading the paper carefully, I still have two doubts:

  1. For the position embedding of the VIT model, after the tokens are merged, how do the position embeddings be merged?
  2. For the token merging method, is it only done within VIT? For the output of VIT, will the number of tokens be reduced accordingly?
dbolya commented 5 months ago
  1. If there are any extra position embeddings, you can just merge those the same way you merged the tokens.
  2. Yes, assuming that there's no global average pooling or anything.