cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.77k stars 116 forks source link

More details about cambrain-1 #57

Closed LoverLost closed 4 months ago

LoverLost commented 4 months ago

Hello!

This is a truly remarkable contribution to the open-source MLLM community! I have a couple of questions regarding Cambrian:

  1. The paper suggests that unfreezing the vision encoder could potentially enhance performance. However, in the open-source code, I noticed that 'unfreeze_mm_vision_tower' is set to false in both the pretrain and finetune shell scripts. Does this mean that the final Cambrian-1 series models freeze this component? Or is there something I may have overlooked?

  2. I also noticed in your configuration that you're using facebook/dinov2-giant-res378, whereas in the paper, DINOv2 ViT-L/14@518 is mentioned. Was the former model trained based on the latter? Additionally, I observed in both the paper and the code that you're utilizing clip-convnext-XXL-multi-stage@1024. However, upon reviewing the code, it seems the base model is CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup. Does this imply that you modified the resolution and continued training Visual Encoder?

  3. Thank you for making the high-quality Instruction Tuning Data available for training MLLMs. I'm curious to know if this dataset includes interleaved text-image data.

Looking forward to your response!

tsb0601 commented 4 months ago

Hi!

Thanks for the interest in our work! I'll answer the questions below:

  1. Our ablation studies show unfreezing the vision encoder will enhance the performance. However, in the Cambrian-1 model, our TPU/TorchXLA infrastructure imposed many non-trivial impediment for us to train large-scale unfreeze vision experiments. So yes, Cambrian-1 series model are trained with freezer vision tower and we are looking to resolve the issue and train unfreeze versions.

  2. Yes, we modified the vision encoder's default resolution to train with larger resolution. We find it particularly helpful for the Convnext models because convnets handle high-resolution more efficiently

  3. No, currently the datasets are all single-image instruction tuning type data.