Closed ruitian12 closed 1 year ago
Hi, thanks for the submission! We will add it as soon as possible
Hi @ruitian12! A small update: I have seen that you may be using a model from timm. Can you please confirm that the architectures you are using are indeed deit_{small,base}_patch16_224
in timm? If this is the case, then I will add the model as soon as #100 (which adds better support for timm models) is merged
Hi @dedeswim. Thanks for reaching out. The models we use are identical with architectures in timm and our checkpoints can be loaded directly onto deit_{small,base}_patch16_224
. So it would be feasible and simple to add our models if timm supports were merged.
That's great, thanks for letting me know!
Hi, sorry for late reply, I uploaded according results for ImageNet-C and ImageNet-3DCC.
Hi! Sorry for my late reply, but I've been very busy recently. Thanks for the update, I will finalize the addition of your model ASAP!
Hi @ruitian12, I am sorry it's taking so long, but I realized there are some changes to do before integrating new corruptions models. Once #111 will be merged we will be able to merge the branch with your model.
Thanks for your patience!
Nvm, I manually computed the mCE values so that I can merge your PR now. I am adding it with PR #112, if everything looks good to you I merge it
Hi @ruitian12! Thanks for the update. I noticed that there is a significant difference between these and the results you originally posted, which is something we didn't observe with other models on the leaderboard. Did anything change between the two evaluations?
Hi @dedeswim, Thanks for your concerns! Since results on full datasets cannot be implemented with single gpu, we do change the evaluation code. We base full-results evaluation on IN-3DCC on our own multi-gpu implementation. For IN-C, the 5000-sample results are based on open-source code. To Note that pre-processing for IN-C evaluation is orignally based on 'Crop224', which is also adopted in our full-dataset evaluation and is different from 'Res256Crop224' in robustbench.
Thanks for your reply @ruitian12! Were the first results computed by using Res256Crop224
, and these later ones on the full dataset with Crop224
? if this is correct, would you mind reporting the results on 5000 images by using Crop224
please? We would prefer to add these to the leaderboard, to keep it consistent with the other entries (computed on 5000 samples), but we also want to report the best result possible for your entry (which seems to be given by Crop224
, instead of Res256Crop224
).
Thanks!
FYI, you can change the preprocessing method to use by passing preprocessing='Crop224'
to the benchmark
function
Hi @dedeswim, Thanks for your careful checks and kind suggestion! After going over the evaluation code, I figured out that the misalignment is caused by failing to conduct normalization when testing for 5000 samples with robustbench. I will upate our results soon in future 2 days by rectifying normalization and preprocessing. Great Thanks!
Hi @dedeswim, Sorry for the late update. I have uploaded the corrected results on IN-C and IN-3DCC for both 5000 samples and full datasets. Particularly, I noticed that there exists performance gaps between different Pillow
versions and he uploaded results are evaluated under Pillow==8.2.0
. Thanks for your careful suggestions after all.
Hi, @dedeswim. Sorry to bother you about the merge. I have already updated the results as mentioned above. Feel free to inform me if there exists any additional concerns. Thanks!
Hi @ruitian12,
I've updated the leaderboard https://robustbench.github.io/#div_imagenet_corruptions_heading with your models. The numbers are close to your latest ones but not exactly the same (~0.4% difference). This doesn't change the ranking though (your models are clearly top-1 and top-2 by a great margin).
I've opened a PR (based on Edoardo's earlier PR) that allows a better support of 2DCC and 3DCC evaluation. I used the following scripts to evaluate your models:
python -m robustbench.eval --n_ex=5000 --dataset=imagenet --threat_model=corruptions_3d --model_name=Tian2022Deeper_DeiT-S --data_dir=/tmldata1/andriush/imagenet --corruptions_data_dir=/tmldata1/andriush/data/3DCommonCorruptions/ --batch_size=256 --to_disk=True
python -m robustbench.eval --n_ex=5000 --dataset=imagenet --threat_model=corruptions_3d --model_name=Tian2022Deeper_DeiT-B --data_dir=/tmldata1/andriush/imagenet --corruptions_data_dir=/tmldata1/andriush/data/3DCommonCorruptions/ --batch_size=256 --to_disk=True
And I used Pillow 9.4.0 so maybe that contributed to a difference. Let us know what you think about these evaluation results.
Best, Maksym
Hi @max-andr, The evaluation results are sound since we also saw a slight drop in performance with an upgraded Pillow version in previous experiments. Thanks for your great efforts!
Best, Rui
perfect! then i'll close this issue
Paper Information
Leaderboard Claim(s)
Model 1
Model 2
Model Zoo: