Closed wusize closed 2 months ago
Thanks for your interest in our work.
Given that the primary computational bottleneck in Transformer models lies in the attention computation, the computational overhead of the softmax operation is acceptable, as it is essentially performed element-wise. Employing softmax instead of nearest neighbor methods helps to avoid inconsistencies between the forward and backward passes.
Hi, authors!
I noticed that you apply softmax function to get probabilities of visual tokens. Is this operation very computation-intensive in terms of speed and gpu cost since both the number of visual tokens and codebook size are very large? Or how about directly predicting vectors that approach the table embeddings as what VQGAN had done (i.e., minimize||z_q - z||)?
A big thanks to your impressive work!