Rationale Behind Removing CLS Token

Hello,

I have a question regarding the embedding modification inside MultiModal2 module, right after getting the output of the CLIPModel.

It seems when image evidences exist, the cls token of the image embedding is removed (https://github.com/VT-NLP/Mocheg/blob/main/verification/model.py#L160)

.whereas when no text evidences given that of the text embedding is removed. (https://github.com/VT-NLP/Mocheg/blob/main/verification/model.py#L167)

My questions are,

What is the reason of removing the CLS token?
When removing CLS token, why is only one of cls tokens removed when both the image/text evidences exist?

VT-NLP / Mocheg