Train data - Githubissues

您好，感谢您的开源和杰出的工作！我想问一下在SEED/MultiModalLLM/configs/data/caption_torchdata_preprocess.yaml中 data_dir:

${oc.env:PROJECT_ROOT}/data/unsplash_resize/webdataset
CC3M/webdataset/gcc3m_shards

我想问一下这里的数据集从哪里下载呢？我关注到论文里有说“We filtered the samples in these datasets based on image resolution, aspect ratio, and visual-textual similarity. We randomly place images or text at the forefront, in order to achieve the generation of captions based on images and vice versa.” 如果可以的话，是否可以开源训练数据呢？非常感谢！

AILab-CVC / SEED

Train data #25