MadryLab / backgrounds_challenge

138 stars 16 forks source link

Test results reported on IN-9 seem to be wrong! #5

Closed aliasgharkhani closed 2 years ago

aliasgharkhani commented 3 years ago

Hi there,

First of all, thank you for your clean code. I cloned your repo together with the test data available in the releases. After I ran your code to test ImageNet pre-trained ResNet-50 on 'mix_rand' and 'mix_same', I got these results: 84.32% and 90.99%, respectively. But in the README, the results are 78.9% and 86.2% on 'mix_rand' and 'mix_same', respectively (On ImageNet pre-trained ResNet-50). Are your reported results correct? Is your published test dataset the same as the dataset you used to get your reported results?

Thank you

kaixiao commented 3 years ago

Hi,

Could you please let me know what command you ran to get those final results?

aliasgharkhani commented 3 years ago

I ran python in9_eval.py --eval-dataset 'mixed_same' --data-path '/PATH/TO/TESTDATA'

kaixiao commented 3 years ago

Hi,

That seems right. All the numbers we show on the README are from running the code in this repo on the dataset in the release.

My best guess at the moment is that the timm package, where we load our pre-trained models from (see: https://github.com/MadryLab/backgrounds_challenge/blob/46d224bb02a296681eddbae44a49da9abb5ba038/tools/model_utils.py#L45), has actually been updated with better models (see this recent work: https://arxiv.org/abs/2110.00476), and these better models also perform correspondingly better on mixed_rand and mixed_same.

Thank you for pointing this out! This actually further illustrates the point made in our paper that better models on the original dataset are also better at classifying even in the presence of confusing backgrounds.