Add seeding to dataset creation, remove label maps

brain-research / realistic-ssl-evaluation

Open source release of the evaluation benchmark suite described in "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms"

Apache License 2.0

460 stars 98 forks source link

Add seeding to dataset creation, remove label maps #21

Closed craffel closed 5 years ago

craffel commented 5 years ago

Resolves #17 and #18. It was impossible to use the existing label maps because the datasets needed to be downloaded, which necessitated creating new label maps. Instead, we should just make the dataset and label map creation processes deterministic by manually setting seeds. This involved changing the README to include label map creation, so I also fixed #18 which specified ImageNet 32x32 creation in the wrong order.

avital commented 5 years ago

Code LGTM. Did you test that both build_tfrecords.py and build_label_maps.py now generate the same results when the seed is fixed?

avital commented 5 years ago

Do you happen to know if Python and numpy guarantee reliability when upgrading to new versions?

avital commented 5 years ago

Looks like https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.RandomState.html has explicit guarantees which seem stronger than calling into random directly. I don't feel too strongly about this but could be worth switching to that...

avital commented 5 years ago

Specifically, it's possible that a library that we import calls random unknowingly to us, or change when/whether they call random in a version upgrade. Using an explicit RandomState would relieve us of that (tiny) risk of non-determinism when upgrading dependencies.

craffel commented 5 years ago

Did you test that both build_tfrecords.py and build_label_maps.py now generate the same results when the seed is fixed?

Yes.

Do you happen to know if Python and numpy guarantee reliability when upgrading to new versions?

Not sure.

Specifically, it's possible that a library that we import calls random unknowingly to us, or change when/whether they call random in a version upgrade. Using an explicit RandomState would relieve us of that (tiny) risk of non-determinism when upgrading dependencies.

Ok, will make this change.

craffel commented 5 years ago

Ok, pushed a commit to switch to np.random.RandomState. The semantics in build_label_maps.py changed because of switching from random.sample to np.random.choice, PTAthoroughL.