Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.
I understand that the number of tokens being pruned depends on the value of lambda multiplied by the rank, while the number of recycled tokens is influenced by the hyper parameter tau .
But the table1 of this paper shows that the number of visual tokens is fixed at 192, 128, and 64.
Could you please clarify whether these token counts were hardcoded to select exactly 192, 128, or 64 visual tokens, or if there was another approach to maintaining a fixed token count for these experiments?
Hello, Thanks for your wonderful research.
I understand that the number of tokens being pruned depends on the value of lambda multiplied by the rank, while the number of recycled tokens is influenced by the hyper parameter tau .
But the table1 of this paper shows that the number of visual tokens is fixed at 192, 128, and 64.
Could you please clarify whether these token counts were hardcoded to select exactly 192, 128, or 64 visual tokens, or if there was another approach to maintaining a fixed token count for these experiments?
Thank you, Sincerely