Closed sguru-sam closed 5 days ago
Question or Issue
I read the interesting blog about omnivision: https://nexa.ai/blogs/omni-vision I have one question on 9x Tokens Reduction through Token Compression.
From the architecture and following writeup
We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size*9]
: it seems the number of floating points of visual tokens will remain the same.I was wondering if this Token Compression method will result in reduction in inference time. Can you please provide an elaborate answer.
Thanks.
OS
No response
Python Version
No response
Nexa SDK Version
No response
GPU (if using one)
No response
Hi, it will reduce inference time! So, the computation of the visual encoder and projection part is the same. However, you still need the computation from the decoder part, the language backbone. Previously, your language model needs to handle 729 tokens, now it becomes 81! Thus, it will greatly reduce the complexity for the decoder part, which is also the main computation time.
Question or Issue
I read the interesting blog about omnivision: https://nexa.ai/blogs/omni-vision I have one question on 9x Tokens Reduction through Token Compression.
From the architecture and following writeup
We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size*9]
: it seems the number of floating points of visual tokens will remain the same.I was wondering if this Token Compression method will result in reduction in inference time. Can you please provide an elaborate answer.
Thanks.
OS
No response
Python Version
No response
Nexa SDK Version
No response
GPU (if using one)
No response