Clean up data loading and randomization; add scheduler

pokey commented 1 year ago

chaosparrot commented 1 year ago

I initially placed a different validation set / train set on each net to make sure each net would see different data ( thus performing better as an ensemble than if they all saw the same data ). Otherwise the only variation they would have after many epochs would be the starting position right?

My question would be: since 3 nets using different train / validation sets can use more data for their training, wouldn't they perform better on novel data? Given a testset which none of the models have seen, wouldn't 3 ensembled models with different training data perform better on than an ensemble created where each model saw exactly the same data?

( From a combined ensemble validation score I can definitely see the benefits of a single split train / validation set, because with an ensemble where the validation set has been seen by some of the models there's obvious data pollution and the results will be skewed )

pokey commented 1 year ago

Huh interesting idea. Makes me a bit nervous, eg we'd want to make sure we know which data seed each ensemble member was using in case we want to resume from checkpoint. But I guess could work?

@ym-han any thoughts? Is this something you've seen before? Reminds me of k-fold cross validation tbh, tho not exactly the same

ym-han commented 1 year ago

I haven't looked at the code so I can't be sure, but it sounds like this could be bagging (or something similar). There's some discussion and a link to some references on the sklearn bagging classifier page (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html), and which I'm going to quote:

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [1]. If samples are drawn with replacement, then the method is known as Bagging [2]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [3]. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [4].

Breiman 1996 "Bagging Predictors", the chapter on bagging in Richard Berk's Statistical Learning from a Regression Perspective, and the section on bagging in Elements of Statistical Learning seem useful if you want to read up more.

pokey commented 1 year ago

Yeah that sounds right. But I presume all of those methods assume you're sampling the training data with a fixed held out validation set, whereas here the validation set of one ensemble member is used as training data for another

chaosparrot commented 1 year ago

Did some reading on KFold and cross validation and Pokey is right in the sense that there is a held out set kept separate from the validation sets for each model. I.e. given we have a data set of A, B, C ... Z, and two models, these are the current splits ( note the overlap ):

Model	Training	Validation
Model A	A, B, C, D, E, F, G, H, I, O, P, Q, R,	J K Y L M N Z
Model B	N, O, P, Q, R, S, T, U, V, F, G, H, I,	W X Y Z D E L
Ensemble	-	J K L M N O W X Y Z D E

Whereas the, in my opinion, best split would be:

Model	Training	Validation
Model A	A, B, C, D, E, F, G, H, I, O, P, Q, R, S, T	J K L M N O
Model B	N, O, P, Q, R, S, T, U, J, K, L, M, G, H, I	A B C D E F
Ensemble	-	V W X Y Z

Where we keep 10 percent of the total data set held out to test the ensemble on, use the remaining 90 percent for training and validation, and have the models have validation sets that do not overlap with one another.

The issue then is still that we need to keep the random seed persisted to keep the split available for checkpointing, but I think this would generate the best model results that do not have any data pollution going on in the validation sets.

ym-han commented 1 year ago

This table is helpful. Yes, I agree that it's important that whatever data is in the test set for the ensemble is not data that has been used to either train or tune the hyperparams of any of the models of the ensemble.

pokey commented 1 year ago

hmm I'd be tempted to either

go with a simple / standard ensemble setup like the one proposed in this PR, or
stick with the existing approach (with some tweaks for reproducibility eg capturing seeds). I can see the argument for utilising all available data for the ensemble, as the average user isn't publishing a paper or anything so doesn't really need proper held-out test set

Happy to hash this one out on Discord tho

chaosparrot commented 1 year ago

The second option seems fine for now, we can revisit the held out data set at a later date

pokey commented 1 year ago

Ok I'll close this one for now; at some point prob worth cleaning up the split code and pulling the LR schedule stuff out of here, but this PR would need a lot of tweaking to get there. I pointed to this PR from the relevant issues

chaosparrot / parrot.py

Clean up data loading and randomization; add scheduler #29