Yxxxb / VoCo-LLaMA

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
https://yxxxb.github.io/VoCo-LLaMA-page/
Apache License 2.0
84 stars 4 forks source link

number of token to store visual information #17

Closed betterze closed 1 week ago

betterze commented 1 month ago

Dear Voco-LLaMa team,

Thank you for your great work, I really like it.

According to Figure 2b, the vision tokens information is compressed into one Voco token, and one Voco token information is used for text generation.

If I understand correctly, in inference, we need to store the activation of the Voco token in each transformer layer, then using them to decode text information. So it compresses the visual information into 'the number of transformer layers' Voco token activation, rather than a single Voco token, is this correct?

Denote the number of transformer layers in Voco-LLAMA is n. In section 4, the comparison to Q-former should be Voco-LLAMA with one Voco token compares to Q-former with n tokens. Is this correct?

Thank you for your help.

Best Wishes,

Alex

Yxxxb commented 1 week ago

Thanks for your interest!

As you said, VoCo-LLaMA needs to cache Transformer activations on a token,(e.g., in LLaMA, 32 hidden states need to be cached). However, in methods that rely on compression by an external module, such as Q-Former, the compressed one token is fed into the LLM still needs to compute these 32 hidden states. Therefore it is fair to make use of 1 VoCo token compared to 1 token in other methods.