bronyayang / Law_of_Vision_Representation_in_MLLMs

Official implementation of the Law of Vision Representation in MLLMs
https://arxiv.org/abs/2408.16357
123 stars 7 forks source link

Discussion of the results in Table 5. #3

Closed yxsysu closed 2 weeks ago

yxsysu commented 2 weeks ago

Hi, authors. Thanks for your great work and your idea is interesting. I have some questions about results in Table 5. The table show that SigLIP is inferior to the CLIP in all aspects. Differently, I found that SigLIP achieves superior performance and better than CLIP in paper "Investigating the Design Space of Visually-Conditioned Language Models" (Figure 6 (left) in paper).

Is this difference due to the experimental setup? Because the vision encoder is trained in their experiments. Does it mean the AC scores are limited to the scenario that the vision encoder is fixed.

Thanks for your response.

bronyayang commented 2 weeks ago

Hi,

Thank you for your interest and great questions.

Differently, I found that SigLIP achieves superior performance and better than CLIP in paper "Investigating the Design Space of Visually-Conditioned Language Models" (Figure 6 (left) in paper). Is this difference due to the experimental setup? Because the vision encoder is trained in their experiments.

The performance difference is indeed because of the experimental setup. I don't know about whether they trained the vision encoder in Figure 6, since they did not mention in Section 4.2. However, I am sure that they do single stage training, whereas we do two-stage training as in LLaVA. They said in their Figure 4 caption - "We find that single-stage training produces VLMs that maintain or outperform multi-stage models (orange), saving considerable compute; as a result, we carry this change forward to all future experiments." (BTW, I think this is not true, because some of the benchmark perf is lower than two stage training. They just don't have compute.)

Does it mean the AC scores are limited to the scenario that the vision encoder is fixed.

Not necessarily. First of all, in the case of vision encoder is trained, it is out of the scope of our testing and out of our assumption. We believe this is very interesting follow-up for anyone to try it.

Why "not necessarily"? Training vision encoder with LLM changes the weight of vision encoder, thus its AC score will change. So, you need to take the trained vision encoder, then calculate its AC score and fit to see if the law holds. Maybe it will also hold with recomputed AC score. However, this is very tricky, because unfreeze vision encoder can potentially change its role. For example, migrating the alignment function into LLM, or other strange effects.

One thing to note is that, the important application of the "law of vision representation" is to choose vision encoder/representation without training a lot of LLMs (all vision encoders + LLMs). If you want to consider the scenario that the vision encoder is trained (with LLM). Then you always first need to train the vision module with LLM, right? So, in this scenario, it lost a big motivation/application. I personally believe there is no point to dwell deeply on this scenario. However, I get this question a lot, so appreciate you wrote it here.