Reproducibility of results

YingdaXia / SynthCP

Offical code base for the ECCV oral paper "Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation"

MIT License

61 stars 9 forks source link

Reproducibility of results #2

Closed DarioFontanel closed 3 years ago

DarioFontanel commented 3 years ago

Hi @YingdaXia,

I have run the code you provided for Anomaly Segmentation with the default parameters, following the intructions specified in the readme. I obtained the following results: AUPR: 6,8 AUROC: 86,2 FPR: 28,2

Among the others, the AUPR is lower than the results reported on the paper, which is 9,3.

Could you help me understand what is missing?

Thank you,

Dario Fontanel

YingdaXia commented 3 years ago

Hi Dario,

This is an interesting result. When we tune our experiments, we always have a good AUPR but not always good FPR, since simple feature space distance tend to produce false positives.

To diagnose the results, please answer the following questions about the details, since there are several components in the overall framework.

Are you using the provided GAN or you trained yourself?
The parameter t in the paper is hardcoded in line 149 of eval_ood_rec.py. You can play around with this hyperparameter and give us your AUPR under t=(-1,0,0.8,0.9,0.99,0.999,1.0)
We are uploading our segmentation model, which is used for all results reported in the paper. I will notify you when it's ready for use.

YingdaXia commented 3 years ago

We uploaded our segmentation model. Please re-download the checkpoints and find it under checkpoints/caos-segmentation/. This model together with the GAN model (already provided) provides all results reported in our paper.

DarioFontanel commented 3 years ago

Hi @YingdaXia,

First of all thank you very much for your help and please excuse me for the late reply but all the evaluations took some time.

These are the results I got: t | AUPR | AUROC | FPR95 t = -1 | 7,0 | 89,0 | 25,0 t = 0 | 7,0 | 89,0 | 25,0 t = 0.8 | 7,2 | 88,8 | 25,0 t = 0.9 | 7,3 | 88,7 | 25,0 t = 0.99 | 7,4 | 88,2 | 25,0 t = 0.999 | 6,8 | 86,2 | 28,2 t = 1 | 5,3 | 72,8 | 61,5

With respect to the first question, I used the provided GAN.

By using the checkpoints you uploaded, instead, the results obtained are exactly the ones of the paper. Could you please tell me what are the new parameters with which you got these weights? Or if something has changed with respect to the previosuly used training procedure?

Thank you very much again,

Dario

YingdaXia commented 3 years ago

Hi Dario,

Thanks for providing the results. We can draw conclusions that (1) not the problem of training the GAN (2) not the problem of the inference code. It seems that the only difference is how we trained the segmentation model.

Actually, regarding the segmentation model, we didn't modify any settings from the original one provided in CAOS benchmark. The weird thing is that they set a seed and we directly copied the configuration provided, so the resulted model should be exactly the same in multiple runs. Anyway, we will definitely further look into this issue.

Btw, we list the environment of our runs in the anomaly segmentation experiments for future reference: conda version : 4.8.2 python version : 3.7.3.final.0 pytorch version: 1.2.0 numpy version: 1.16.4

Best, Yingda