Closed liuhl-source closed 3 years ago
I guess the image files were corrupted in the first place. It also happened to me, thus I made the dataset to use another random index if such error occurs. However, considering the size of the dataset (several million), I remember that the number of corrupted files was not that many.
FYI, I used aria2c to download images from the image URLs text directly.
$ aria2c -i uris.txt
This is how i downloaded SBU (it took alot, days for the complete dataset of 1M images).
import requests
import shutil
import os
import json
if not os.path.isdir('images'):
os.mkdir('images')
num_images_to_download = 10e5
urls = []
with open('sbu/SBU_captioned_photo_dataset_urls.txt') as f:
for line in f:
urls.append(line.strip())
captions = []
with open('sbu/SBU_captioned_photo_dataset_captions.txt') as f:
for line in f:
captions.append(line.strip())
img_id = 0
filtered_captions = []
for url,caption in zip(urls, captions):
img_id += 1
response = requests.get(url, stream = True)
if response.status_code == 404:
continue
else:
with open('images/' + str(img_id) + '.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
filtered_captions.append(caption)
if img_id == num_images_to_download:
break
f = open('filtered_sbu_caps.txt','w')
for c in filtered_captions:
f.write(c +'\n')
f.close()
When I try to train the model, there are some problems with the Dataloader. I get many errors such as 'Error while read file idx 433 in conceptual_caption_val_0 -> cannot identify image file <_io.BytesIO object at 0x7f36766d9bd0>'.
Many images can not be load. I don't know why. Do you have any suggestions? or Can you share the scripts for downloading the GCC, SBU dataset? Thank you very much! :)