cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.4k stars 88 forks source link

Question about figure 7 #10

Closed digbangbang closed 5 days ago

digbangbang commented 6 days ago
image

Hi, wonderful work!

I'm wonder what happened in Vision-Centric from 5M frozen to 5M unfrozen when fine-tuning the OpenAI CLIP. Because the performance has dropped a lot!

Have u analysis this in ur work?

ellisbrown commented 5 days ago

Hi @digbangbang, thanks for your interest in our work!

The full benchmark results are in the Appendix Table 14. We can see that with 5M tuning examples, CLIP improved with unfreezing at MMVP and RWQA, but actually got worse at CVBench.

There are many moving parts in training an MLLM, and it's possible that the CLIP unfrozen model did not have a perfect hparam setting (we did not have time/compute to tune every model and used fixed hparams across the model ablations).

Some speculation: This might also be a result of the downsides of CLIP's training objective—it's known to be relatively poor at relation/ordering^1, which are what CVBench tests. When you unfreeze the vision backbone in the MLLM setting, you are essentially updating it via a captioning loss (which has some advantages over the contrastive loss^2); it's possible that the CLIP representations are being updated with the extended training and haven't yet stabilized. The model could be trained long enough to be unlearning the CLIP-style representations (converting to captioning-style representations), but not long/well enough to improve again at these types of tasks.

image