Closed JoaquinChou closed 5 months ago
It should work for the vision encoder out of the box (assuming it uses ViT), but applying it to the language model is an open research question right now. The problem is that language models use causal attention, so you can't just merge tokens arbitrarily.
Thanks for your answer. After reading the paper carefully, I still have two doubts:
Hello, I would like to ask if Token Merging method could be used in accelerate inference on current multi-modal large language models such as LLAVA and VILA?