Preprocessing for the celebA dataset

jeffjiang1204 commented 2 months ago

Hi, I'm trying to reproduce the result but don't seem to able to get the same training set and testing set as you do. I followed the other issue #3 and the provided code code/celebA_to_torch.py but got something that's a bit off:

Specifically, I downloaded img_align_celeba from the website and used their provided list_eval_partition.txt to separate training and testing sets. There are 202,599 images, and the file lists 0-162,770 as train, 162,771-182,637 as val and 182,638-202,599 as test. I passed the path of the train and test folders to code/celebA_to_torch.py and get a training set that's not the same size as you used (I noticed that your notebook has the training checkpoint of size torch.Size([198650, 1, 80, 80]). Mine ends up being somewhat smaller (~ 162,769) and also probably in the wrong order (as a result I can't reproduce the curve for the train input/output PSNR). Is there a different cutoff you used for train/test set of the celebA images? Or is there any missteps in the procedures I did above? (alternatively, if there's a link to download the checkpoints used)

I also noticed that code/celebA_to_torch.py uses s=.125 for the load_CelebA_dataset function, but I have to use s=.5 to get the 80x80 images (since the original images from img_align_celeba are 178 × 218, if I used 0.125 it seems to give me 20x20 images). I just want to double check that this is expected and not because I did not download the right dataset.

Thanks for the clarifications!

chenzeno commented 2 months ago

Hi,

I also tried running the pre-processing code on all 202,599 images, but after using the remove repeats function, I ended up with only 196,018 images of size 80x80, instead of the 198,650 images of 80x80 mentioned in your checkpoint. Could you clarify why this discrepancy occurs, or if there's a step I might have missed? Did the remove repeats function use 20X20 downsampled images (as written in the code)?

If we remove the following lines in the function quality_metric_func we get the number of 198748 (i.e. we find the repeats on 80X80 images)

''' # pool = torch.nn.AvgPool2d(int(dataset.shape[2]/nn_dim)) ''' # dataset_down = pool(dataset)

Thanks!

HilaManor commented 2 months ago

I would like to jump in with a related problem. I noticed that in Demo_UNet_CelebA80x80.ipynb in the Load datasets (cell No. 5) the split is determined by the order of the images loaded (since train = data[0:N], and test = data[-N:].
Since the images were loaded in celebA_to_torch.py:L17>dataloader_func.py:load_CelebA_dataset by simply using os.listdir, the order of the loaded images is completely arbitrary, and dependent on the computer's filesystem and the way it stores the files. This means that it is impossible to reproduce the train-test splits used for the trained checkpoints, and we would need to train the models from scratch, which takes a lot of compute.

Could you possibly provide train80x80_no_repeats.pt files?

Zahra-Kadkhodaie commented 2 months ago

Hi, I'm trying to reproduce the result but don't seem to able to get the same training set and testing set as you do. I followed the other issue #3 and the provided code code/celebA_to_torch.py but got something that's a bit off:

Specifically, I downloaded img_align_celeba from the website and used their provided list_eval_partition.txt to separate training and testing sets. There are 202,599 images, and the file lists 0-162,770 as train, 162,771-182,637 as val and 182,638-202,599 as test. I passed the path of the train and test folders to code/celebA_to_torch.py and get a training set that's not the same size as you used (I noticed that your notebook has the training checkpoint of size torch.Size([198650, 1, 80, 80]). Mine ends up being somewhat smaller (~ 162,769) and also probably in the wrong order (as a result I can't reproduce the curve for the train input/output PSNR). Is there a different cutoff you used for train/test set of the celebA images? Or is there any missteps in the procedures I did above? (alternatively, if there's a link to download the checkpoints used)

I also noticed that code/celebA_to_torch.py uses s=.125 for the load_CelebA_dataset function, but I have to use s=.5 to get the 80x80 images (since the original images from img_align_celeba are 178 × 218, if I used 0.125 it seems to give me 20x20 images). I just want to double check that this is expected and not because I did not download the right dataset.

Thanks for the clarifications!

Sorry for the delayed response. We didn't use the train/test partition from the celeba dataset. We combined all images and then partitioned the data after removing repeated images. To do this, take images [0:N] for train and [-N::] for test, where N is the set size. But you're right that the ordering is different depending on GPU. I uploaded the preprocessed images (cropped to 160x160, downsampled to 80x80) here. Although it's named "train", it's the entire data. Just use the partitioning like above. This is my personal Gdrive, so hopefully it works, but let me know if there is any issues. As for s=.125, that's the default value for the high quality celeba dataset (CelebAHQ). I cropped those images from 512x512 to 320x320 and then down sampled to 40x40 for the smaller network (BF_CNN architecture with receptive field size 40x40). The result for this one is in the appendix. This dataset is smaller (around 30K images), so we couldn't use it for training UNet with larger RF and still get generalization.

Zahra-Kadkhodaie commented 2 months ago

Hi,

I also tried running the pre-processing code on all 202,599 images, but after using the remove repeats function, I ended up with only 196,018 images of size 80x80, instead of the 198,650 images of 80x80 mentioned in your checkpoint. Could you clarify why this discrepancy occurs, or if there's a step I might have missed? Did the remove repeats function use 20X20 downsampled images (as written in the code)?

If we remove the following lines in the function quality_metric_func we get the number of 198748 (i.e. we find the repeats on 80X80 images)

''' # pool = torch.nn.AvgPool2d(int(dataset.shape[2]/nn_dim)) ''' # dataset_down = pool(dataset)

Thanks!

Hi - I believe I downsampled the images to 20x20 before computing the similarity due to memory issue. In effect, that creates a stricter threshold because downsampling gets rid of details which might be different. I need to go back and dig in to find out why we get different number of images, but in the meantime here is the data we used after pre-processing. It's possible I only used downsampling for 160x160 images, but I need to double check and make sure.

Zahra-Kadkhodaie commented 2 months ago

I would like to jump in with a related problem. I noticed that in Demo_UNet_CelebA80x80.ipynb in the Load datasets (cell No. 5) the split is determined by the order of the images loaded (since train = data[0:N], and test = data[-N:]. Since the images were loaded in celebA_to_torch.py:L17>dataloader_func.py:load_CelebA_dataset by simply using os.listdir, the order of the loaded images is completely arbitrary, and dependent on the computer's filesystem and the way it stores the files. This means that it is impossible to reproduce the train-test splits used for the trained checkpoints, and we would need to train the models from scratch, which takes a lot of compute.

Could you possibly provide train80x80_no_repeats.pt files?

You're right. Order is arbitrary. Here is the link to the data.

LabForComputationalVision / memorization_generalization_in_diffusion_models

Preprocessing for the celebA dataset #4