Non-Reproducible Results

SirRob1997 commented 3 years ago

I've been following the repository for a while and it seems like other people have observed something similar. The results presented in the paper seem to be not reproducible, even with the code you provide. Also, the method changed heavily in the implementation throughout the last few weeks, at one point also using downsampled middle layers which isn't described in the paper. Very recently you also introduced self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10 as a scheduling factor.

I've run the currently provided code producing the following results. Funnily enough, the sketch domain seems to be the only one not underperforming. I've tested the performance over the last few weeks with the different iterations you had but always observed similar underperformance, just never bothered to create an issue since there were a few already open for the code-base at the time e.g #2 #5.

Commit: 10540a86b5dea957355591ec66da85d959b94657 Details: Environment as specified, Default hyperparameters, running one script at a time, 5 runs, Official PACS splits Backbone: ResNet-18

PHOTO: Paper Result - 95.99%

Difference to observed maximum is -1.62%, Difference to observed minimum is -2,76% Sample mean/std deviation for best validation performance: 93.73 +/- 0.4

Best val 0.958482, corresponding test 0.932335 - best test: 0.943713, best epoch: 17 Best val 0.95255, corresponding test 0.934132 - best test: 0.943114, best epoch: 17 Best val 0.946619, corresponding test 0.943713 - best test: 0.943713, best epoch: 3 Best val 0.950178, corresponding test 0.937126 - best test: 0.943114, best epoch: 18 Best val 0.953737, corresponding test 0.939521 - best test: 0.947904, best epoch: 19

ART: Paper Result - 83.43%

Difference to observed maximum is -1.75%, Difference to observed minimum is -4.04% Sample mean/std deviation for best validation performance: 80.41 +/- 1.1

Best val 0.967742, corresponding test 0.816895 - best test: 0.817383, best epoch: 18 Best val 0.962779, corresponding test 0.794434 - best test: 0.807617, best epoch: 13 Best val 0.961538, corresponding test 0.800781 - best test: 0.81543, best epoch: 19 Best val 0.967742, corresponding test 0.814941 - best test: 0.825684, best epoch: 19 Best val 0.961538, corresponding test 0.793945 - best test: 0.805664, best epoch: 16

CARTOON: Paper Result - 80.31%

Difference to observed maximum is -1.47%, Difference to observed minimum is -3.37% Sample mean/std deviation for best validation performance: 77.53 +/- 0.9

Best val 0.965251, corresponding test 0.781143 - best test: 0.808874, best epoch: 19 Best val 0.96139, corresponding test 0.773891 - best test: 0.799488, best epoch: 18 Best val 0.956242, corresponding test 0.765785 - best test: 0.789676, best epoch: 18 Best val 0.96139, corresponding test 0.788396 - best test: 0.799915, best epoch: 19 Best val 0.96139, corresponding test 0.767491 - best test: 0.800341, best epoch: 19

SKETCH: Paper Result - 80.85%

Difference to observed maximum is +1.05%, Difference to observed minimum is -1.67% Sample mean/std deviation for best validation performance: 80.79 +/- 1.0

Best val 0.957792, corresponding test 0.811402 - best test: 0.811402, best epoch: 16 Best val 0.962662, corresponding test 0.805548 - best test: 0.808094, best epoch: 18 Best val 0.965909, corresponding test 0.791805 - best test: 0.799949, best epoch: 18 Best val 0.967532, corresponding test 0.819038 - best test: 0.820056, best epoch: 18 Best val 0.956169, corresponding test 0.811911 - best test: 0.82082, best epoch: 17

BurningFr commented 3 years ago

In paper https://arxiv.org/abs/2010.05785, they also report their reproduction results, which is also far away from the proposed results.

Justinhzy commented 3 years ago

Hi, Since most DG datasets are relatively small datasets compared with ImageNet, the results may fluctuate in different environments. This also happens in other DG repositories. I am trying to solve this issue by introducing a scheduling factor.

I will upload my models and logs, so You can download them for testing, which are much better than your reported results, a little better than my reported results.

DeLightCMU / RSC