number of token to store visual information

Yxxxb / VoCo-LLaMA

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".

Apache License 2.0

84 stars 4 forks source link

Dear Voco-LLaMa team,

Thank you for your great work, I really like it.

According to Figure 2b, the vision tokens information is compressed into one Voco token, and one Voco token information is used for text generation.

If I understand correctly, in inference, we need to store the activation of the Voco token in each transformer layer, then using them to decode text information. So it compresses the visual information into 'the number of transformer layers' Voco token activation, rather than a single Voco token, is this correct?

Denote the number of transformer layers in Voco-LLAMA is n. In section 4, the comparison to Q-former should be Voco-LLAMA with one Voco token compares to Q-former with n tokens. Is this correct?

Thank you for your help.

Best Wishes,

Alex

Yxxxb / VoCo-LLaMA

number of token to store visual information #17