Closed betterze closed 1 week ago
Thanks for your interest!
As you said, VoCo-LLaMA needs to cache Transformer activations on a token,(e.g., in LLaMA, 32 hidden states need to be cached). However, in methods that rely on compression by an external module, such as Q-Former, the compressed one token is fed into the LLM still needs to compute these 32 hidden states. Therefore it is fair to make use of 1 VoCo token compared to 1 token in other methods.
Dear Voco-LLaMa team,
Thank you for your great work, I really like it.
According to Figure 2b, the vision tokens information is compressed into one Voco token, and one Voco token information is used for text generation.
If I understand correctly, in inference, we need to store the activation of the Voco token in each transformer layer, then using them to decode text information. So it compresses the visual information into 'the number of transformer layers' Voco token activation, rather than a single Voco token, is this correct?
Denote the number of transformer layers in Voco-LLAMA is n. In section 4, the comparison to Q-former should be Voco-LLAMA with one Voco token compares to Q-former with n tokens. Is this correct?
Thank you for your help.
Best Wishes,
Alex