Open lizhiustc opened 2 years ago
Copy-paste the email reply here:
Yes, these token losses perform similarly. We thus choose the simplest one. To me, it's classification.
Token label is also a strong supervision. For me, they are mostly used for distillation. Contrastive and L2-reg are more like distillation, but tokens can do the same (e.g., in the language mode distillations). Some other works to look at are: wave2vec 2.0, DINO.
I have two questions.
(1) I notice that in your code https://github.com/airsplay/vokenization/blob/5601b799184ed54414872565f233e22c76f5f6f0/vlm/model.py#L238 , you design three loss function voken classification, voken regression and voken constrastive. But you only report "voken classification" in paper, maybe you find "voken regression and voken constrastive" both don't work or even harm model performance after trials? Is my guess correct ? (Because image features are far different from language embeddings. )
(2) What's the intuition that voken classification loss can improve model performance ? I suspect that different words with similar semantic will have same voken labels and voken classification loss will optimize their similarity. What is your opinion?Could you give me some intuition from your views?