brain-research / realistic-ssl-evaluation

Open source release of the evaluation benchmark suite described in "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms"
Apache License 2.0
458 stars 98 forks source link

Results in the paper for low-regime data #9

Closed TheRevanchist closed 5 years ago

TheRevanchist commented 5 years ago

Hi guys,

I've been trying to replicate your results. When I am training the net with 500 labeled data points, I saw some weird behavior.

screenshot from 2018-11-19 09-26-34

As you can see from the tensorboard, the accuracy of the net increases fast, achieves a peak at around 70k iterations, and then it falls to 0.1 which is random. The results here are given for pi-model, but I saw the same behavior also with mean teacher.

My questions are:

1) Is this the expected behavior (did you also see this phenomenon)? 2) If yes, then how do you get the numbers for the paper? I assume that you look for the highest peak in the validation set, and then with that checkpoint you do a testing in the testing set. Am I right in this assumption? (I also saw that one of the saved nets is the one which gives the best result so thus my assumption). 3) Why do you do so many tests in the testing set, instead of just one test with the net which gives the best results in the validation set?

Cheers!

craffel commented 5 years ago

Is this the expected behavior (did you also see this phenomenon)?

Yes. Note that the hparams were tuned for maximal validation accuracy, not maximal validation accuracy specifically at 500k iterations. Note also that models were tuned on SVHN with 1000 examples and CIFAR-10 with 4000 examples, and were not retuned for any experiments.

I assume that you look for the highest peak in the validation set, and then with that checkpoint you do a testing in the testing set.

That's right.

Why do you do so many tests in the testing set, instead of just one test with the net which gives the best results in the validation set?

I don't know what you mean here. Are you asking why we evaluate the model on the test set throughout training? This way when we want to get the test accuracy at the point of highest validation accuracy, we can grab it directly without re-running eval -- we can just use the data in the .events files. You might find this is somewhat common practice. We emphasize that we only never tuned anything based on test accuracy, but instead only looked at the test accuracy at the step corresponding to the highest validation accuracy.