Closed pkulwj1994 closed 5 months ago
Hi @pkulwj1994,
Please refer to this issue
How to prepare dataset "gqa" in "projects/dreamllm/configs/stage1/vicuna11_7b_llavapretran_comprehension_only.py"? I see "gqa_sft_train_short_filtered.json", it's approx_size is 13532530? how to prepare?
I have download raw file from gqa_download
Hi Runpei,
Great appreciation for your work. I am trying to test the stage-1 training, but I find that the Laion400m data is a little bit confusing. My issue is how I can use the Laion400m data for training, could you please give a clear instruction? Thank you!
The original code for the definition of the dataset is in the following. I don't know where to get the "data/resources/laion400m_origin20m_shard_list.json" file
source code:
L(WebDatasetInfo)( name="laion400m_orig", description="The length and width of the image are the original size, but only 20M was downloaded.", dataset_type=DatasetType.ImageTextPair, cls=UnifiedITPairWebdataset, approx_size="20M", shard_list_path="data/resources/laion400m_origin20m_shard_list.json", ),
Best wishes.