cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.77k stars 115 forks source link

Cambrian Data: Sources unclear in the JSONs #73

Open kushalkafle opened 3 months ago

kushalkafle commented 3 months ago

Hi there! Thanks a lot for the great work! This will save many researchers countless hours in data prep.

However, I noticed several issues regarding the correct assignment of the "source" information in the existing JSONs. As the paper points out mixing the data is one of the key challenges. However, this lack of information about sources does not allow other users to further build on top of Cambrian data. E.g., If there are new sources of data, we may want to rebalance Cambrian. I was wondering if you would be willing to update the JSONs with detailed source information.

Here are a few examples, but I fear there might be more.

I (and I'm sure many others) would be very grateful if the true source of each sample could be provided in an updated json :)

PS: here is the list of all unique sources that is actually found in the 7M data. You can clearly see many categories are missing and some have incorrect number of samples.

['clean_llava_instruct_150k_llavar_20k.json',
 'geo170k.json',
 'orca_math_200k.json',
 'lvis_instruct4v_220k.json',
 'lnqa_302k.json',
 'idefics375k.json',
 'vizwiz_20k.json',
 'qalign_200k.json',
 'wizardlm_143k.json',
 'allava-vflan-200k.json',
 'ai2d_15k.json',
 'sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json',
 'clevr_700k.json',
 'oodvqa_8k.json',
 'mathinstruct_262k.json',
 'code_feedback_66k.json',
 'scienceqa_12k.json',
 'random_3rd_dvqa_2325k.json',
 'laion_gpt4v_11k.json',
 'idk_11k.json',
 'synthdog_500k_modified.json',
 'screenqa_79k.json',
 'orca_994k.json',
 'arxivqa_100k.json',
 'docvqa_39k.json',
 'design2code_0k.json',
 'mathvision_3k.json',
 'chartqa_28k.json',
 'alfworldgpt_45k.json',
 'gpt77k.json',
 'sketchyvqa_8k.json',
 'allava-laion-500k.json',
 'q-instruct_200k.json',
 'filtered_data_engine_161k.json',
 'tallyqa_250k.json',
 'pathvqa_32k.json']