Open ndhuynh02 opened 1 month ago
I asked this question a while back on Reddit: https://www.reddit.com/r/MachineLearning/comments/1cpqe3h/r_trying_to_understand_a_certain_function_in/
We ended up trying our best to reproduce it by testing out different configurations. Nothing really worked well. https://arxiv.org/abs/2405.14239
As expressed in the paper, after passing the patch embeddings through ViT and a decoder D, we get some feature vectors (each patch is a vector). What confuses me is the online quantizer h() [78] mentioned in the paper. As far as I understand in the Dino paper, these feature vectors are softmax-ed to create some distribution; hence, I imagine this h() is also work like that. However, I don't understand what is quantized here and how exactly it is transformed into distribution. Can anybody help me explain this?