cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
https://cambrian-mllm.github.io/
Apache License 2.0
1.4k stars 88 forks source link

Can you release the 161K science-related data separately? #14

Closed Weiyun1025 closed 5 days ago

Weiyun1025 commented 5 days ago

Thank you for your awesome work!

I notice that you have released the 10M instruction tuning dataset. Could you please release the 161K science-related data mentioned in the paper separately? Alternatively, could you provide guidance on how to filter out the 161K data from the larger dataset?

ellisbrown commented 5 days ago

Thanks for your interest @Weiyun1025!

We don't currently plan to release the science-related data separately. However, we will make sure the source of each data point is accessible so that the data can be filtered easily!

We'll follow up when this is done cc @tsb0601

tsb0601 commented 5 days ago

Hi @Weiyun1025! We actually already have the separated data for data engine. If you want to separately use the data engine in our project, check out the json file here: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/jsons/data_engine_161k.jsonl

Images are here: from https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/data_engine.tar.gz_part1 to https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/data_engine.tar.gz_part13 Please download the images, merge and extract the tars following merge: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/merge_tars.py and extract: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/extract.py