Question About Frozen Text Encoder

kimwongyuda commented 2 weeks ago

I am really inspired of and thank for your nice work.

The question is "Why is text encoder frozen when training?".

When I fine-tune VISTA model using other dataset such as M-BEIR, the results without freezing is better than those with frozen text encoder.

I just wonder your intention of frozen text encoder.

Thank you.

JUNJIE99 commented 2 weeks ago

Hi,

Thank you for your interest in our work!

The primary motivation behind VISTA is to enhance a pre-trained text encoder with visual capabilities while preserving its strong text retrieval performance. We believe that the quality of text embeddings is crucial for (multimodal) dense retrieval, particularly in tasks involving multimodal document retrieval with significant textual content, such as the WebQA and ReMuQ benchmarks.

In developing VISTA, our main concern has been its generalization capability, particularly its zero-shot performance. We have also observed that for specific tasks, not freezing the text encoder can yield better results. Consequently, in our paper, we have kept the text encoder unfrozen for downstream fine-tuning (Section 4.2). However, we think that fine-tuning the text encoder on specific datasets might compromise VISTA's inherent text retrieval capabilities derived from the BGE model.

Regarding your application scenario, while M-BEIR encompasses various retrieval contexts, it includes training data that makes downstream tasks in-domain tests. Therefore, I believe it is quite reasonable that performance improves when the text encoder is not frozen.

Lastly, I greatly appreciate your efforts in testing VISTA on M-BEIR. I would love to know if you could share your fine-tuning results with me. I am also very curious about VISTA's performance on downstream tasks within M-BEIR.

Thank you!

kimwongyuda commented 2 weeks ago

Thank you! I share the result! Metric is Recall@10 in Fashion200k and FashionIQ and Recall@5 in others.

VISTA(not FT) VisualNews (T -> I): 0.0013 MSCOCO (T -> I): 0.0059 Fashion200k (T -> I): 0.0006 WebQA (T -> T): 0.7572 EDIS (T -> TI): 0.2049 WebQA (T -> TI): 0.5257 VisualNews (I -> T): 0.0001 MSCOCO (I -> T): 0.005 Fashion200k (I -> T): 0 NIGHTS (I -> I): 0.2212 OVEN (TI -> T): 0.0024 Infoseek (TI -> T): 0.0009 FashionIQ (TI -> I): 0.1556 CIRR (TI -> I): 0.1612 OVEN (TI -> TI): 0.4204 InfoSeek (TI -> TI): 0.3065

VISTA(FT with frozen) VisualNews (T -> I): 0.019 MSCOCO (T -> I): 0.4731 Fashion200k (T -> I): 0.1001 WebQA (T -> T): 0.758 EDIS (T -> TI): 0.3505 WebQA (T -> TI): 0.6233 VisualNews (I -> T): 0.0218 MSCOCO (I -> T): 0.3852 Fashion200k (I -> T): 0.0636 NIGHTS (I -> I): 0.3104 OVEN (TI -> T): 0.0056 Infoseek (TI -> T): 0.0184 FashionIQ (TI -> I): 0.0596 CIRR (TI -> I): 0.1597 OVEN (TI -> TI): 0.4625 InfoSeek (TI -> TI): 0.3367

VISTA(FT without frozen) VisualNews (T -> I): 0.0951 MSCOCO (T -> I): 0.601 Fashion200k (T -> I): 0.1518 WebQA (T -> T): 0.8037 EDIS (T -> TI): 0.3993 WebQA (T -> TI): 0.7208 VisualNews (I -> T): 0.1126 MSCOCO (I -> T): 0.8388 Fashion200k (I -> T): 0.152 NIGHTS (I -> I): 0.3014 OVEN (TI -> T): 0.3298 Infoseek (TI -> T): 0.217 FashionIQ (TI -> I): 0.1794 CIRR (TI -> I): 0.3153 OVEN (TI -> TI): 0.5169 InfoSeek (TI -> TI): 0.4259

JUNJIE99 commented 2 weeks ago

Many thanks for sharing these results; I really appreciate it!

I have another question: Did you use the instruction method mentioned in UniIR for your fine-tuning?

kimwongyuda commented 2 weeks ago

Yes I used instructions in training and also evaluation. However, instructions in UniIR are fixed and static for each task, while instructions in VISTA are more various because instructions don't depend on each task but each instance.

I think the definitions of instruction are different between UniIR and VISTA.

So I expect the difference is the reason zero-shot result of VISTA is not good at M-BEIR evaluation with instructions.

JUNJIE99 commented 2 weeks ago

Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper?

kimwongyuda commented 2 weeks ago

Thank you. I would like to confirm once more: Were these results obtained using the complete M-BEIR corpus, as referenced in Table 2 of the UniIR paper?

Yes they were!

JUNJIE99 commented 2 weeks ago

Thank you for your response. It appears that VISTA continues to show outstanding zero-shot performance on M-BEIR compared to UniIR.

Regarding the fine-tuning of VISTA with instructions, I believe it has the potential to achieve even better results due to its early fusion of image and text tokens. This is definitely worth exploring further.

I greatly appreciate the results you've shared and our ongoing discussions. These insights are incredibly valuable to me.

If you have any further questions, I am always open to more discussions.

Thank you for your time.

JUNJIE99 / VISTA_Evaluation_FineTuning

Question About Frozen Text Encoder #7