cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.68k stars 110 forks source link

Question about Cambrian-Alignment dataset #37

Open HYLcool opened 2 months ago

HYLcool commented 2 months ago

Hi, thanks for your excellent work on MLLMs!

I downloaded the Cambrian-Alignment dataset and I found there might be something wrong with this dataset.

When I checked the sources of samples in this dataset, I found that the ALLaVA part comes from the instruction split instead of the caption split like that in the MGM dataset. The IDs of these samples are all with prefixes of "allava_laion_inst" or "allava_vflan_inst". In the MGM dataset, the prefixes are "allava_laion_cap" and "allava_vflan_cap".

Besides, I found that all samples of ALLaVA part in Cambrian-Alignment appear again in Cambrian-10M. So I wonder are there some special considerations for using instruction data in the modality alignment stage or is something wrong with the Cambrian-Alignment dataset?

Looking forward to your reply~

lezhang7 commented 1 month ago

Also there is llava-pretrain and sbu558k both exist in the aligment data, I wonder the difference between them.

HYLcool commented 1 month ago

Also there is llava-pretrain and sbu558k both exist in the aligment data, I wonder the difference between them.

According to the Cambrian paper, in my opinion, the Cambrian-Alignment consists of 2 sources (MGM, ShareGPT4V) and 5 subsets (allava/sbu558k from MGM/LLaVA and coco/llava_pretrain/sam from ShareGTP4V). The sbu558k from LLaVA was originally used in LLaVA and the llava_pretrain from ShareGPT4V was recaptioned by GPT4V based on sbu558k. So they are different datasets containing the same images but different captions.

My question above is that the allava part from MGM is supposed to be allava_laion/vflan_cap splits instead of the allava_laion/vflan_inst splits in the released Cambrian-Alignment dataset. 🤔

lezhang7 commented 1 month ago

Thanks, that makes sense then.

jihanyang commented 1 month ago

Sorry for the late reply! I just came across this issue and found that our released version mistakenly used the instruct-follow subset of ALLAVA instead of the caption subset, we will fix this and upload the correct JSON file soon. Thanks for pointing out that!

JianbangZ commented 1 month ago

Sorry for the late reply! I just came across this issue and found that our released version mistakenly used the instruct-follow subset of ALLAVA instead of the caption subset, we will fix this and upload the correct JSON file soon. Thanks for pointing out that!

When do you plan to release the GPU training code? Thanks