VT-NLP / Mocheg

Dataset and Code for Multimodal Fact Checking and Explanation Generation (Mocheg)
Apache License 2.0
36 stars 8 forks source link

Rationale Behind Removing CLS Token #12

Open given131 opened 5 months ago

given131 commented 5 months ago

Hello,

I have a question regarding the embedding modification inside MultiModal2 module, right after getting the output of the CLIPModel.

It seems when image evidences exist, the cls token of the image embedding is removed (https://github.com/VT-NLP/Mocheg/blob/main/verification/model.py#L160)

.whereas when no text evidences given that of the text embedding is removed. (https://github.com/VT-NLP/Mocheg/blob/main/verification/model.py#L167)

My questions are,

  1. What is the reason of removing the CLS token?
  2. When removing CLS token, why is only one of cls tokens removed when both the image/text evidences exist?