Data prepare - Githubissues

liuhl-source commented 3 years ago

When I try to train the model, there are some problems with the Dataloader. I get many errors such as 'Error while read file idx 433 in conceptual_caption_val_0 -> cannot identify image file <_io.BytesIO object at 0x7f36766d9bd0>'.
Many images can not be load. I don't know why. Do you have any suggestions? or Can you share the scripts for downloading the GCC, SBU dataset？ Thank you very much! :)

dandelin commented 3 years ago

I guess the image files were corrupted in the first place. It also happened to me, thus I made the dataset to use another random index if such error occurs. However, considering the size of the dataset (several million), I remember that the number of corrupted files was not that many.

FYI, I used aria2c to download images from the image URLs text directly. $ aria2c -i uris.txt

fawazsammani commented 3 years ago

This is how i downloaded SBU (it took alot, days for the complete dataset of 1M images).

import requests
import shutil
import os
import json

if not os.path.isdir('images'):
    os.mkdir('images')

num_images_to_download = 10e5

urls = []
with open('sbu/SBU_captioned_photo_dataset_urls.txt') as f:
    for line in f:
        urls.append(line.strip())

captions = []
with open('sbu/SBU_captioned_photo_dataset_captions.txt') as f:
    for line in f:
        captions.append(line.strip())

img_id = 0
filtered_captions = []

for url,caption in zip(urls, captions):

    img_id += 1

    response = requests.get(url, stream = True)

    if response.status_code == 404:
        continue
    else:
        with open('images/' + str(img_id) + '.png', 'wb') as out_file:
            shutil.copyfileobj(response.raw, out_file)
        filtered_captions.append(caption)

    if img_id == num_images_to_download:
        break

f = open('filtered_sbu_caps.txt','w')
for c in filtered_captions:
    f.write(c +'\n')

f.close()

dandelin / ViLT

Data prepare #2