Cambrian Data: Sources unclear in the JSONs

Hi there! Thanks a lot for the great work! This will save many researchers countless hours in data prep.

However, I noticed several issues regarding the correct assignment of the "source" information in the existing JSONs. As the paper points out mixing the data is one of the key challenges. However, this lack of information about sources does not allow other users to further build on top of Cambrian data. E.g., If there are new sources of data, we may want to rebalance Cambrian. I was wondering if you would be willing to update the JSONs with detailed source information.

Here are a few examples, but I fear there might be more.

In the Cambrian7M.json there are a lot less unique sournces than all the things listed in the paper. E.g., refcoco, vqa, etc.. are never found in sources column in the JSONs. They seem to be consolidated within some other "source" category, which makes it difficult to backtrack its source (if not impossible). There are ~68 sources mentioned in the paper, but the "source" field in json just has 38 sources.
Some "sources" have a lot more data than has been shown in the paper. I suspect this is related to above point. i.e., they contain data from other sources just incorrectly named or named based on the processing applied rather than their true source. E.g., the source "sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json" has 665058 where the actual shareGPT data is said to be just 40K

I (and I'm sure many others) would be very grateful if the true source of each sample could be provided in an updated json :)

PS: here is the list of all unique sources that is actually found in the 7M data. You can clearly see many categories are missing and some have incorrect number of samples.

['clean_llava_instruct_150k_llavar_20k.json',
 'geo170k.json',
 'orca_math_200k.json',
 'lvis_instruct4v_220k.json',
 'lnqa_302k.json',
 'idefics375k.json',
 'vizwiz_20k.json',
 'qalign_200k.json',
 'wizardlm_143k.json',
 'allava-vflan-200k.json',
 'ai2d_15k.json',
 'sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json',
 'clevr_700k.json',
 'oodvqa_8k.json',
 'mathinstruct_262k.json',
 'code_feedback_66k.json',
 'scienceqa_12k.json',
 'random_3rd_dvqa_2325k.json',
 'laion_gpt4v_11k.json',
 'idk_11k.json',
 'synthdog_500k_modified.json',
 'screenqa_79k.json',
 'orca_994k.json',
 'arxivqa_100k.json',
 'docvqa_39k.json',
 'design2code_0k.json',
 'mathvision_3k.json',
 'chartqa_28k.json',
 'alfworldgpt_45k.json',
 'gpt77k.json',
 'sketchyvqa_8k.json',
 'allava-laion-500k.json',
 'q-instruct_200k.json',
 'filtered_data_engine_161k.json',
 'tallyqa_250k.json',
 'pathvqa_32k.json']

cambrian-mllm / cambrian

Cambrian Data: Sources unclear in the JSONs #73