Gumpest / SparseVLMs

Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.
https://arxiv.org/pdf/2410.04417
Apache License 2.0
53 stars 3 forks source link

Clarification on Fixed Visual Token Counts (192, 128, 64) in Table 1 #10

Open naajeehxe opened 1 week ago

naajeehxe commented 1 week ago

Hello, Thanks for your wonderful research.

I understand that the number of tokens being pruned depends on the value of lambda multiplied by the rank, while the number of recycled tokens is influenced by the hyper parameter tau .

But the table1 of this paper shows that the number of visual tokens is fixed at 192, 128, and 64.

Could you please clarify whether these token counts were hardcoded to select exactly 192, 128, or 64 visual tokens, or if there was another approach to maintaining a fixed token count for these experiments?

Thank you, Sincerely

Gumpest commented 3 days ago

I appreciate your interest in our work.

  1. The retained tokens are controlled by changing the scaling factor in Formula 8.
  2. This is the equivalent number of tokens. For example, T = {(L1-L0) * T0 + (L2-L1) * T1} / L2.
  3. Therefore, we select exactly 192, 128, or 64 visual tokens to compare with other methods fairly.