Closed Weiyun1025 closed 5 days ago
Thanks for your interest @Weiyun1025!
We don't currently plan to release the science-related data separately. However, we will make sure the source of each data point is accessible so that the data can be filtered easily!
We'll follow up when this is done cc @tsb0601
Hi @Weiyun1025! We actually already have the separated data for data engine. If you want to separately use the data engine in our project, check out the json file here: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/jsons/data_engine_161k.jsonl
Images are here: from https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/data_engine.tar.gz_part1 to https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/data_engine.tar.gz_part13 Please download the images, merge and extract the tars following merge: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/merge_tars.py and extract: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M/blob/main/extract.py
Thank you for your awesome work!
I notice that you have released the 10M instruction tuning dataset. Could you please release the 161K science-related data mentioned in the paper separately? Alternatively, could you provide guidance on how to filter out the 161K data from the larger dataset?