Downloading cc3m with some wrong

NExT-GPT / NExT-GPT

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

https://next-gpt.github.io/

BSD 3-Clause "New" or "Revised" License

3.29k stars 326 forks source link

Downloading cc3m with some wrong #45

Open 1190300611 opened 1 year ago

1190300611 commented 1 year ago

When downloading the cc3m dataset, an error is constantly displayed: 'Field "caption" does not exist in table schema'.

After reviewing the img2dataset document, it was found that the following needs to be added

pip install sed

sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv

img2dataset --url_list Train_GCC-training.tsv --input_format "tsv"\ --url_col "url" --caption_col "caption" --output_format webdataset\ --output_folder cc3m --processes_count 16 --thread_count 64 --image_size 256\ --enable_wandb True

1920993165 commented 1 year ago

yes, but I find the results is not like to the demo. The files have not images any. This is the downloaded files. Can anyone tell me if it's right?

ChocoWu commented 1 year ago

Hi, this is right. The downloaded images are stored in a tar file. And the *_stats.json file provides information about the download status, including the total number of images, the number successfully downloaded, and the number of failures.

zxy1123 commented 8 months ago

so can you tell me how translate the webdataset to your data format，thanks

1903812532 commented 1 month ago

下载 cc3m 数据集时，不断显示错误：'Field 'caption' does not exist in table schema'。

在查看 img2dataset 文档后，发现需要添加以下内容

pip install sed

sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv

img2dataset --url_list Train_GCC-training.tsv --input_format "tsv"\ --url_col "url" --caption_col "caption" --output_format webdataset\ --output_folder cc3m --processes_count 16 --thread_count 64 --image_size 256\ --enable_wandb True

你好，可以问下你的img2dataset版本吗。我使用的1.0.1版本下不下来

ChocoWu commented 1 month ago

下载 cc3m 数据集时，不断显示错误：'Field 'caption' does not exist in table schema'。在查看 img2dataset 文档后，发现需要添加以下内容 pip install sed sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv img2dataset --url_list Train_GCC-training.tsv --input_format "tsv"\ --url_col "url" --caption_col "caption" --output_format webdataset\ --output_folder cc3m --processes_count 16 --thread_count 64 --image_size 256\ --enable_wandb True

你好，可以问下你的img2dataset版本吗。我使用的1.0.1版本下不下来

The version of img2dataset I used is 1.45.0, that works fine.

1903812532 commented 1 month ago

下载 cc3m 数据集时，不断显示错误：'Field 'caption' does not exist in table schema'。在查看 img2dataset 文档后，发现需要添加以下内容 pip install sed sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv img2dataset --url_list Train_GCC-training.tsv --input_format "tsv"\ --url_col "url" --caption_col "caption" --output_format webdataset\ --output_folder cc3m --processes_count 16 --thread_count 64 --image_size 256\ --enable_wandb True

你好，可以问下你的img2dataset版本吗。我使用的1.0.1版本下不下来

The version of img2dataset I used is 1.45.0, that works fine.

Thank you.