Neuroglycerin / neukrill-net-tools

Tools coded as part of the NDSB competition.
MIT License
0 stars 0 forks source link

DensePNGDataset class has to have options to split dataset #57

Closed gngdb closed 9 years ago

gngdb commented 9 years ago

Need to be able to split dataset into validation and test sets, but obviously have to split it in a stratified way. Three way stratified split required.

I can't help but think it's weird that it doesn't do this internally, splitting on each epoch instead of using the same split every single time. Seems like it's not a great way to do cross-validation.

You come across this before @matt-graham ?

gngdb commented 9 years ago

This could be annoying, because the two different splits occur in the initialisation of two different objects from the same DensePNGDataset class and neither knows which datapoints the first selected for each split. So we have to split it in a way that's deterministic and then select different splits.

matt-graham commented 9 years ago

The standard for pylearn2 is to just have a held out set which you can then use to with monitoring to check for overfitting on each epoch and terminate, adjust LR schedule etc. If you were to use a different split for each epoch then you would effectively be using the whole dataset to train the model just that each epoch would be over some random subset of the whole set. Early stopping (or adjustment of LR schedule) using the validation set would make less sense in this situation as you would have partially fit to the validation set as well. Sander Dieleman who won the Galaxy Zoo competition using convolutional networks apparently used a 10% held out set for validation.

It looks like there was some attempt to include a cross validation feature but the changes don't seem to have been merged in to the main codebase.

gngdb commented 9 years ago

Going to have to use a random seed to make sure splits are deterministically the same between different initialisations. Also, this kind of sucks because it means the code has to load the dataset three times, then throw away different parts of it. Don't see any other way to do though, while still playing ball with the YAML (and not using that would be even harder).

gngdb commented 9 years ago

Actually, I'm being stupid, I should just do a stratified split of the image paths before loading. That's faster and easier to keep track of.

gngdb commented 9 years ago

That's what I did, and it seems to work. Haven't fully checked that it is stratified and that there is no overlap, but I don't see where it could go wrong.