983632847 / All-in-One

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
MIT License
13 stars 1 forks source link

The language is useless. #6

Closed SY-Xuan closed 5 months ago

SY-Xuan commented 5 months ago

In your implementations, the [CLS] token of the BertEmbedding is used. Without any attention operation, the [CLS] token of the BertEmbedding cannot take any information about the input description. In other words, the language modal is useless in the current implementations. Do I have any misunderstanding here?

983632847 commented 5 months ago

We just use BertEmbedding as the tokenizer. During the training phase, the tokenizer can be updated and the mean token can be used. We also found that using only the trained visual branch for inference can achieve good results, due to the large amount of ambiguity in language prompts.

SY-Xuan commented 4 months ago

In your implementations, whatever the input description is, the output of the [CLS] will be the same. Therefore, I don`t think the performance improvement is related to the adding of the language description.