Closed SY-Xuan closed 5 months ago
We just use BertEmbedding as the tokenizer. During the training phase, the tokenizer can be updated and the mean token can be used. We also found that using only the trained visual branch for inference can achieve good results, due to the large amount of ambiguity in language prompts.
In your implementations, whatever the input description is, the output of the [CLS] will be the same. Therefore, I don`t think the performance improvement is related to the adding of the language description.
In your implementations, the [CLS] token of the BertEmbedding is used. Without any attention operation, the [CLS] token of the BertEmbedding cannot take any information about the input description. In other words, the language modal is useless in the current implementations. Do I have any misunderstanding here?