data61 / landshark

Large-scale spatial inference with Tensorflow.
Apache License 2.0
10 stars 11 forks source link

Add option for training fold "blocks" to avoid over-fitting #6

Open dtpc opened 5 years ago

dtpc commented 5 years ago

Models can overfit when training samples are spatially adjacent.

A way to mitigate this is to select a pixel block size when extracting training folds such that pixels in the same local block are assigned the same fold.

The model will be encouraged to predict well outside areas local to the training data during cross-validation/model-selection.

dtpc commented 5 years ago

I've implemented this here: https://github.com/dtpc/landshark/tree/feature/6-fold-blocks

It does not account for the distribution of training points over the area, so may/will result in folds of unequal size.

Another approach I think would be useful is grouping based on some other training point property (e.g. https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data). Implementing this would require some more structural changes to the code, though. Currently the target HDF5 file only contains y and coord data.

dsteinberg commented 5 years ago

Oh yeah? do you mean when we select data randomly for our train/test folds, we can get an underestimate of the true error if our test points are often close to the training points?

Or by doing this are we testing if our model generalizes well away from the training data?

dtpc commented 5 years ago

The later, although I think "away from the training data" may not be that far in some cases.

Typically the training data is heavily biased, sparse but often locally dense. I think this can lead to learning very localised models, especially if the targets are highly correlated spatially. In the extreme case if neighbouring pixels (and target values) are more or less identical, then the model could potentially just learn the input (this is even more of a issue if we have training points located within the same pixel). This would be an accurate model, but probably not a very useful one to generate a predictive map from.

So, I think there is a need for different ways of splitting train/test data to encourage a more general model during model selection.

dsteinberg commented 5 years ago

Yeah agreed - a few more splitting methods would be useful. This problem in general though is very hard -- it's really hard to know how a model will behave "away" from the training data... the exception is maybe a Gaussian Process with a prior distribution over kernel parameters - these sorts of models "revert" to their prior away from data, and you can specify that prior (Gaussian processes where we "learn" the prior don't necessarily have this behaviour). There are also models where you can learn what your training data looks like, and when you are querying the model with different data.

dtpc commented 5 years ago

Yes, this is definitely not intended as a solution for covariate shift. I guess just providing more flexibility around model selection/evaluation.