list_train_files in OpenImageV6

hyunW3 commented 5 months ago

Thank you for great work!

I want to know how to download openimagev6 and how to obtain "list_train_files.txt" files in openimagev6 dataset

I'm currently working with training phase following notebooks/PerceptualCompression.ipynb. Since I'm not familiar with the OpneImageV6 dataset, I use fiftyOne package to download OpenImageV6 following official openImage webpage

import fiftyone

dataset = fiftyone.zoo.load_zoo_dataset("open-images-v6", split="train")

However, the training phase returns error with

Percentage of Linear Layer Parameters: 34.88%
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/PerCo/src/train_sd_perco.py", line 1142, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/PerCo/src/train_sd_perco.py", line 792, in main
[rank0]:     train_dataset = OpenImagesV6(root=cfg_perco.data_dir,
[rank0]:   File "/mnt/PerCo/src/openimages_v6.py", line 1214, in __init__
[rank0]:     self.indices = _load_indices(self.image_list_file, split=split)
[rank0]:   File "/mnt/PerCo/src/openimages_v6.py", line 1157, in _load_indices
[rank0]:     with open(indices_path, "r") as f:
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/PerCo/OpenImageV6/list_train_files.txt'

However, with fiftyOne download, there is no list_train_files.txt. The below image is regarding dataset hierarchy

Thank for your help

Nikolai10 commented 5 months ago

Hello @hyunW3,

the full dataset must be downloaded manually, see section Download Full Dataset With Google Storage Transfer. Note that the tsv files contain image urls, so Google Storage Transfer is not a strict requirement. Caution: 18TB.

From there, please familiarize yourself with the Open Images V6 data loader:

"Since Open Images puts all data into a single folder, it is expected that the user has already created a text file _list_trainfiles.txt with all the images of the split prior to instantiating this class (otherwise an os.walk takes forever)."

Note that fiftyone only provides a subset (1.7M):

"Open Images V6 is a dataset of ~9 million images, roughly 2 million of which are annotated and available via this zoo dataset.", see ref. A similar subset is supported in our tutorial; simply replace

--dataset_name="clic" \ with --dataset_name="open_images_v4" \.

Hope this helps, Nikolai

hyunW3 commented 5 months ago

Thank you for your kind and rapid reply! However, 18TB is too large for me... I'm looking forward your pre-trained model :)

Nikolai10 / PerCo

list_train_files in OpenImageV6 #1