AIDC-AI / Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B
Apache License 2.0
506 stars 29 forks source link

Is SoftMax Computation-Intensive? #8

Closed wusize closed 2 months ago

wusize commented 2 months ago

Hi, authors!

I noticed that you apply softmax function to get probabilities of visual tokens. Is this operation very computation-intensive in terms of speed and gpu cost since both the number of visual tokens and codebook size are very large? Or how about directly predicting vectors that approach the table embeddings as what VQGAN had done (i.e., minimize||z_q - z||)?

A big thanks to your impressive work!

runninglsy commented 2 months ago

Thanks for your interest in our work.

Given that the primary computational bottleneck in Transformer models lies in the attention computation, the computational overhead of the softmax operation is acceptable, as it is essentially performed element-wise. Employing softmax instead of nearest neighbor methods helps to avoid inconsistencies between the forward and backward passes.