reproducibility of FFHQ-70K and 140k from Figure 7 of paper

georgestein commented 1 year ago

Using the provided models I am unable to reproduce the FID reported in Figure 7 of the paper.

For example, the paper reports FID=4.30 for this model: https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/paper-fig7c-training-set-sweeps/ffhq70k-paper256-ada.pkl

But I get FID=5.3. I find similar increases for the other models provided.

I notice there are a number of similar issues opened in the past for the same suite of models (mainly for FFHQ-1k). Were these ever resolved? Is so what was the solution?

nurpax commented 9 months ago

Hi @georgestein, thanks for reporting this and sorry for not responding earlier. I saw your comment in the Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models paper and I'll quote it here:

We used the following checkpoint https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/paper-fig7c-training-set-sweeps/ffhq140k-paper256-ada.pkl. Reported FID=3.81, ours=5.30 was significantly larger. The settings we used to generate the images were consistent with the codebase and with those used in the paper.

Thanks for precisely describing how you computed FID for StyleGAN2-ADA in your paper!

We did a bit of data archaeology and root-caused the difference in FID reporting to how the FFHQ 256x256 dataset was obtained. Back when StyleGAN2-ADA was originally developed on TensorFlow, we were using a TFRrecords-version of FFHQ where lower resolutions such as 256x256 are obtained through box filtering. This is the original dataset that was used to compute the FID 3.81 score you saw in the paper.

When we ported to PyTorch we got rid of the progressive LOD format and reformatted our datasets into zip files. It makes sense to downscale using Lanczos and that's what dataset_tool.py does by default. However, to reproduce the original paper's FID scores, one should use a box filtered (dataset_tool.py --resize-filter=box) version of FFHQ 256x256 to match with the paper.

The README shows the wrong steps for creating FFHQ 256x256 and this is clearly a bug. I'll edit the README to show the correct dataset_tool.py command line for resizing FFHQ to 256x256 resolution.

I also went ahead and tested FID computation with the paper-fig7c-training-set-sweeps/ffhq140k-paper256-ada.pkl model. I am able to obtain FID 3.81 when using the box filtered version of FFHQ 256x256. Here are my repro steps:

python calc_metrics.py --metrics=fid50k_full --data path/to/datasets/ffhq-256x256-box.zip --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/paper-fig7c-training-set-sweeps/ffhq140k-paper256-ada.pkl

# outputs:
# {"results": {"fid50k_full": 3.808835723180682}, "metric": "fid50k_full", "total_time": 454.76893281936646, "total_time_str": "7m 35s", "num_gpus": 1, "snapshot_pkl": "https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/paper-fig7c-training-set-sweeps/ffhq140k-paper256-ada.pkl", "timestamp": 1695376997.1809232}

I also tried computing FID using a more recent version of our FID evaluator:

# generate images for fid computation
python gen_images.py --outdir=ffhq140k-paper256-ada-trunc1.0_50k --trunc=1 --seeds=0-49999 --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/paper-fig7c-training-set-sweeps/ffhq140k-paper256-ada.pkl

# compute FID
python fid.py ref --data path/to/ffhq-256x256-box.zip --dest out/ffhq-256x256-box.npz
python fid.py calc --images path/to/ffhq140k-paper256-ada-trunc1.0_50k --ref out/ffhq-256x256-box.npz
# latter outputs:
# Calculating FID...
# 3.88971

The FID difference between Lanczos and box filtered FFHQ 256x256 is very noticeable. I got roughly FID 4.5-4.7 if I repeat the above steps using the Lanczos filtered FFHQ 256x256.

georgestein commented 9 months ago

Hi Janne, thank you so much for your detailed reply and solving the issue! And for your followups over email where you were unable to reproduce a result in our paper. I will answer both here for future reference and transparency.

As you show above, the difference in FID that I was seeing when using stylegan2-ada is simply due to the original paper using box filtering, while I (and the readme) were using Lanczos.

when using ffhq70k-paper256-ada.pkl, FID=4.3 using Box filtering and FID=5.3 using Lanczos - this was the difference I saw in the initial issue I raised. when using ffhq140k-paper256-ada.pkl, FID=3.8 using Box filtering

So for anyone else reading there are no reproducibility issues with StyleGAN2-ADA FID values

Over email you were then unable to reproduce the FID_DINOv2=514.78 that we report in our paper when using our repository, when using the ffhq140k-paper256-ada.pkl.model that I documented in Appendix A.3.

The reason is because I documented the wrong model in our Appendix, and we actually used ffhq70k-paper256-ada.pkl for all analyses. When troubleshooting my above discrepancy between your FID values and my FID (which we now know are due to different filters when downsizing FFHQ 1024^2) I ran a test of all your models, to see if it was perhaps just an odd one out that did not match. Then, when writing up the appendix this resulted in me referencing the wrong model checkpoint.

When using ffhq140k-paper256-ada.pkl you found: {'run00': {'fd': 464.2486266662104, 'fd_infinity_value': 460.768857127296, 'kd_value': 1.397603187841658, 'kd_variance': 0.029347820003614335}}

When using the same model I reproduce your results: {'run00': {'fd': 465.02293054300986, 'kd_value': 1.402422503499429, 'kd_variance': 0.030370566612737994, 'precision': 0.6202, 'recall': 0.1062, 'density': 0.38166000000000005, 'coverage': 0.4248}

When using ffhq70k-paper256-ada.pkl I reproduce the results in our paper: {'run00': {'fd': 515.6768787154372, 'kd_value': 1.640400001954906, 'kd_variance': 0.03784080787651467, 'precision': 0.6066, 'recall': 0.059, 'density': 0.37302, 'coverage': 0.395}

I apologize for the inconvenience this caused and thank you for your work on the issue. I will fix this error in our Appendix for the camera-ready version of the paper, and also submit a revision to the arXiv version. These versions will 1) add the explicit resizing details on FFHQ, similar to how we did with ImageNet, 2) will reference the ffhq70k-paper256-ada.pkl checkpoint instead of 140k, and 3) fix/address/remove the comment on StyleGAN2-ADA reproducibility issues.

NVlabs / stylegan2-ada-pytorch

reproducibility of FFHQ-70K and 140k from Figure 7 of paper #283