chail / patch-forensics

Investigating patches for fake image classification
130 stars 18 forks source link

Reproduce results #5

Closed yanling-lai closed 3 years ago

yanling-lai commented 3 years ago

Hello!

Your work is incredible! Thank you very much for providing the code.

I am trying to reproduce block 2 and block 3 in table 12 in the paper. I downloaded the code and created dataset without modifying anything. I ran the code without any early interruption with following command (as provided in "01_train_gan_xception_patches_samplesonly.sh"):

  python train.py --gpu_ids 0 --seed 0 --loadSize 299 --fineSize 299 \
          --name gp1d-gan-samplesonly --save_epoch_freq 200 \
          --real_im_path dataset/faces/celebahq/real-tfr-1024-resized128 \
          --fake_im_path dataset/faces/celebahq/pgan-pretrained-128-png \
          --suffix seed{seed}_{which_model_netD}_{lr_policy}_p{patience} \
          --which_model_netD xception_block2 --model patch_discriminator \
          --patience 20 --lr_policy constant --max_epochs 1000 \
          --no_serial_batches

  python train.py --gpu_ids 0 --seed 0 --loadSize 299 --fineSize 299 \
          --name gp1d-gan-samplesonly --save_epoch_freq 200 \
          --real_im_path dataset/faces/celebahq/real-tfr-1024-resized128 \
          --fake_im_path dataset/faces/celebahq/pgan-pretrained-128-png \
          --suffix seed{seed}_{which_model_netD}_{lr_policy}_p{patience} \
          --which_model_netD xception_block3 --model patch_discriminator \
          --patience 10 --lr_policy constant --max_epochs 1000 \
          --no_serial_batches

However for the evaluation on test dataset, the result is a lot different for GLOW and GMM. For GLOW the result is much higher, but for GMM it could not reach the best results. I am wondering is there any step that I missed out before training. Thank you very much. (p.s. I had used the provided pre-trained model to evaluate the same test dataset and it produced same results as in paper.)

The following table is result for block 2 that I reproduced. ("rp" means reproduce, vn means the nth time of reproduce with exactly same settings) Screenshot from 2021-05-04 10-35-04

The following table is for block 3. Screenshot from 2021-05-04 10-59-57

yanling-lai commented 3 years ago

Sorry, I think the table I provided was not accurate enough since I did not state the training and testing dataset. For the training dataset, I used celebahq and progressive gan image produced using the provided code ("00_data_processing_export_tfrecord_to_img.sh", "00_data_processing_sample_celebahq_models.sh").

For all of the tables, "rp" means reproduce, vn means the nth time of reproduce with exactly same settings, and pre-train means using the pre-train model downloaded from the link provided in readme.

The following table is the test result for block 2 on the dataset downloaded from the link provided in readme. Screenshot from 2021-05-04 17-56-35

The following table is the test result for block 2 on the dataset produced using the code provided. Screenshot from 2021-05-04 18-01-47

The following table is the test result for block 3 on the dataset downloaded from the link provided in readme. Screenshot from 2021-05-04 18-27-26

The following table is the test result for block 2 on the dataset produced using the code provided. Screenshot from 2021-05-04 18-28-26

Thank you very much.

chail commented 3 years ago

Thanks for the thorough experiments!

To debug, I wonder if the checkpoints are converging in similar places? I have the checkpoints saving at ep 543 for the model named gp1d-gan-samplesonly_seed0_xception_block2_constant_p20 and ep 261 for the model named gp1d-gan-samplesonly_seed0_xception_block3_constant_p10

For dataset preparation, here's a link to the train directories I used for the tfrecord celebahq images and the pgan samples: https://www.dropbox.com/s/s933qpxa5xzremo/dataset_celebahq_pgan.zip?dl=0

As a side note, I did find the preprocessing to be pretty tricky, since I don't want the classifier to simply learn the differences in real and fake image preprocessing rather than actual image content. In my experiments I did pretty aggressive downsizing to 128px and an additional resize step prior to classification, but it could be that less aggressive preprocessing would be sufficient.

yanling-lai commented 3 years ago

Hello! Sorry for the late reply!

Thank you very much for the train dataset! I trained the models again for five times using your train data, the convergence place vary a lot, but the results are similar to yours! We are guessing it might be the small differences between the random vectors (for the GAN) generated by different machine that are causing the small differences between our images.

截圖 2021-06-02 下午9 31 45

And thank you for your side note!