Question about figure 7

Hi @digbangbang, thanks for your interest in our work!

The full benchmark results are in the Appendix Table 14. We can see that with 5M tuning examples, CLIP improved with unfreezing at MMVP and RWQA, but actually got worse at CVBench.

There are many moving parts in training an MLLM, and it's possible that the CLIP unfrozen model did not have a perfect hparam setting (we did not have time/compute to tune every model and used fixed hparams across the model ablations).

Some speculation: This might also be a result of the downsides of CLIP's training objective—it's known to be relatively poor at relation/ordering^1, which are what CVBench tests. When you unfreeze the vision backbone in the MLLM setting, you are essentially updating it via a captioning loss (which has some advantages over the contrastive loss^2); it's possible that the CLIP representations are being updated with the extended training and haven't yet stabilized. The model could be trained long enough to be unlearning the CLIP-style representations (converting to captioning-style representations), but not long/well enough to improve again at these types of tasks.

cambrian-mllm / cambrian

Question about figure 7 #10