RunpeiDong / DreamLLM

[ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation
https://dreamllm.github.io/
Apache License 2.0
388 stars 6 forks source link

How to train with laion400m data in Stage1? #14

Closed pkulwj1994 closed 5 months ago

pkulwj1994 commented 6 months ago

Hi Runpei,

Great appreciation for your work. I am trying to test the stage-1 training, but I find that the Laion400m data is a little bit confusing. My issue is how I can use the Laion400m data for training, could you please give a clear instruction? Thank you!

The original code for the definition of the dataset is in the following. I don't know where to get the "data/resources/laion400m_origin20m_shard_list.json" file

source code: L(WebDatasetInfo)( name="laion400m_orig", description="The length and width of the image are the original size, but only 20M was downloaded.", dataset_type=DatasetType.ImageTextPair, cls=UnifiedITPairWebdataset, approx_size="20M", shard_list_path="data/resources/laion400m_origin20m_shard_list.json", ),

Best wishes.

RunpeiDong commented 5 months ago

Hi @pkulwj1994,

Please refer to this issue

ybbwcwaps commented 1 month ago

How to prepare dataset "gqa" in "projects/dreamllm/configs/stage1/vicuna11_7b_llavapretran_comprehension_only.py"? I see "gqa_sft_train_short_filtered.json", it's approx_size is 13532530? how to prepare?

I have download raw file from gqa_download