[QUESTION] Omnivision: Does token compression method results in an inference speedup?

NexaAI / nexa-sdk

Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.

Apache License 2.0

4.22k stars 624 forks source link

Question or Issue

I read the interesting blog about omnivision: https://nexa.ai/blogs/omni-vision I have one question on 9x Tokens Reduction through Token Compression.

From the architecture and following writeup We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size*9]: it seems the number of floating points of visual tokens will remain the same.

I was wondering if this Token Compression method will result in reduction in inference time. Can you please provide an elaborate answer.

Thanks.

OS

No response

Python Version

No response

Nexa SDK Version

No response

GPU (if using one)

No response

Question or Issue

I read the interesting blog about omnivision: https://nexa.ai/blogs/omni-vision I have one question on 9x Tokens Reduction through Token Compression.

From the architecture and following writeup We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size*9]: it seems the number of floating points of visual tokens will remain the same.

I was wondering if this Token Compression method will result in reduction in inference time. Can you please provide an elaborate answer.

Thanks.

OS

No response

Python Version

No response

Nexa SDK Version

No response

GPU (if using one)

No response

Hi, it will reduce inference time! So, the computation of the visual encoder and projection part is the same. However, you still need the computation from the decoder part, the language backbone. Previously, your language model needs to handle 729 tokens, now it becomes 81! Thus, it will greatly reduce the complexity for the decoder part, which is also the main computation time.

NexaAI / nexa-sdk