JuliaAI / MLJBase.jl

Core functionality for the MLJ machine learning framework
MIT License
161 stars 45 forks source link

Where to implement spatial resampling methods #989

Open tiemvanderdeure opened 1 month ago

tiemvanderdeure commented 1 month ago

In my field (ecology/species distribution modelling) it is very common to use spatial resampling, and I've written some spatial ResamplingStrategys, such as spatial cross-validation, and am considering where to share that code. I'm considering to either:

The problem with the last option is that right now it's not really possible to pass additional information (such as the point location) of data to machine. I'm hacking around this in SDM.jl by calling train_test_pairs directly.

I would like to hear what others think about this?

ablaom commented 1 month ago

Thanks @tiemvanderdeure for posing this interesting question. I'm trying to understand the required interface points better but have not done spatial resampling before. A ResamplingStrategy can have parameters. Is there a reason the "point location" cannot be one of these? Or are you saying it is needed by fit (in which case it is a hyperparameter??). Could you say a little more on this point?

tiemvanderdeure commented 1 month ago

It's wouldn't be needed by fit, only by evaluate!.

In my field, observations might be locations where a species was/wasn't found. One then extracts information about these points, like climate, land use, distance to a road, etc, and fits a model based on these. The spatial resampling is used to make sure the model learned something about the species and not just the random spatial patterns.

So every row in X would have a point location, and a spatial resampling strategy would use these locations, e.g. to construct a grid and cross-validate grid cells instead of observations.

If points are a parameter in the ResamplingStrategy then it could only be used for one particular X and y, which defeats the purpose a little bit.

But the more I think about it, the more I realize that this might require quite a lot of changes to the interface to work.