NexaAI / nexa-sdk

Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
https://docs.nexa.ai/
Apache License 2.0
4.22k stars 624 forks source link

[QUESTION] Omnivision: Does token compression method results in an inference speedup? #261

Closed sguru-sam closed 5 days ago

sguru-sam commented 5 days ago

Question or Issue

I read the interesting blog about omnivision: https://nexa.ai/blogs/omni-vision I have one question on 9x Tokens Reduction through Token Compression.

From the architecture and following writeup We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size*9]: it seems the number of floating points of visual tokens will remain the same.

I was wondering if this Token Compression method will result in reduction in inference time. Can you please provide an elaborate answer.

Thanks.

OS

No response

Python Version

No response

Nexa SDK Version

No response

GPU (if using one)

No response

alexchen4ai commented 5 days ago

Question or Issue

I read the interesting blog about omnivision: https://nexa.ai/blogs/omni-vision I have one question on 9x Tokens Reduction through Token Compression.

From the architecture and following writeup We developed a reshaping mechanism in the projection stage that transforms image embeddings from [batch_size, 729, hidden_size] to [batch_size, 81, hidden_size*9]: it seems the number of floating points of visual tokens will remain the same.

I was wondering if this Token Compression method will result in reduction in inference time. Can you please provide an elaborate answer.

Thanks.

OS

No response

Python Version

No response

Nexa SDK Version

No response

GPU (if using one)

No response

Hi, it will reduce inference time! So, the computation of the visual encoder and projection part is the same. However, you still need the computation from the decoder part, the language backbone. Previously, your language model needs to handle 729 tokens, now it becomes 81! Thus, it will greatly reduce the complexity for the decoder part, which is also the main computation time.