NVlabs / NVAE

The Official PyTorch Implementation of "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020 spotlight paper)
https://arxiv.org/abs/2007.03898
Other
1.03k stars 163 forks source link

Query: FFHQ Pre-Processing #39

Open KomputerMaster64 opened 2 years ago

KomputerMaster64 commented 2 years ago

Thank you sir for sharing the scripts for dataset preparation. I am trying to implement the DDGAN model on the FFHQ 256x256 dataset. I have used the FFHQ 256x256 resized dataset from the kaggle since the FFHQ 1024x1024 dataset has a size of 90 GB, which exceeds the limits of my resources.


The Kaggle dataset has the files in archive.zip file, which has a directory "resized" which contains the 70k .jpg files.


The file structure is as follows: archive.zip ├  resized ├ (70k images)


I am using google drive and colab notebooks for the implementation. I am using the file setup with CODE_DIR = "/content/drive/MyDrive/Repositories/NVAE" and DATA_DIR = "/content/drive/MyDrive/Repositories/NVAE/dataset_nvae". When I try to run the command !python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/resized/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=train, I get the following error message:

Traceback (most recent call last):
  File "create_ffhq_lmdb.py", line 70, in <module>
    main(args.split, args.ffhq_img_path, args.ffhq_lmdb_path)
  File "create_ffhq_lmdb.py", line 46, in main
    im = Image.open(img_path)
  File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2843, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/55962.png'
KomputerMaster64 commented 2 years ago

I altered the line 45 from img_path = os.path.join(ffhq_img_path, '%05d.png' % i) to img_path = os.path.join(ffhq_img_path, '%05d.jpg' % i) since the kaggle ffhq 256x256 resized dataset has .jpg image files.

The above change has resulted in the command !python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/resized/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=train giving the following output

100
200
300
400
500
600
700
800
900
1000
1100
1200
.
.
.
.
.
.
KomputerMaster64 commented 2 years ago

I cross checked with the files that were unzipped. The number of files should be 70k but after repeated unzipping operations I am able to extract only 50k or 52k images even though the output of the cell shows the last file unzipped was 69999.jpg



Google Colab Notebook and Google Drive used for the implementation.
Command used: !unzip images1024x1024.zip -d $DATA_DIR/ffhq/
Last few lines of h the output of the cell:

  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69990.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69991.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69992.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69993.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69994.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69995.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69996.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69997.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69998.jpg  
  inflating: /content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/69999.jpg 


Output of the Google Drive after the operation. image

KomputerMaster64 commented 2 years ago

I altered the line 45 from img_path = os.path.join(ffhq_img_path, '%05d.png' % i) to img_path = os.path.join(ffhq_img_path, '%05d.jpg' % i) since the kaggle ffhq 256x256 resized dataset has .jpg image files. The above change has resulted in the command !python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/resized/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=train giving the following output

100
200
300
400
500
600
700
800
900
1000
1100
1200
.
.
.
.
.
.

After executing the command !python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/resized/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=train I am getting the following output showing that the training set has been converted into the LMDB dataset:

48600
48700
48800
48900
49000
...
62800
62900
63000
added 63000 items to the LMDB datset.


HOWEVER, right after 2 minutes, the above suggested output changes to the following output:

48600
48700
48800
48900
49000
49100
...
main(args.split, args.ffhq_img_path, args.ffhq_lmdb_path)
File "create_ffhq_lmdb.py", line 55, in main
print('added %d items to the LMDB dataset.' % count)
lmdb.Error: mdb_txn_commit: Disk quota exceeded



This behaviour is not observed for the validation set. I request you to please guide me.