fatiando / verde

Processing and gridding spatial data, machine-learning style
https://www.fatiando.org/verde
BSD 3-Clause "New" or "Revised" License
599 stars 72 forks source link

Deal with class imbalance in blocked cross-validation #262

Open AndrewAnnex opened 4 years ago

AndrewAnnex commented 4 years ago

Description of the desired feature

I am using gempy to produce geologic models of multiple geologic layers simultaneously. In verde it seems that points are only ever considered part of 1 surface and 1 class to predict, but in gempy I of course have multiple layers. Additionally there needs to be a way to make sure that every class is present in the training dataset, otherwise the model will not be able to predict for that class. That functionality is already present in sklearn stratified k fold, but of course the block portion is not there.

example image of issue: image

the red dots are the test data and the blue are the training data in the map view on the left, on the right the test data is orange, there are 22 classes but it is clear that around class 14/15 the full sample of that class is only present in the test dataset

Are you willing to help implement and maintain this feature? Yes/No

yes and no.. I can dig into the code to see how difficult this is but I think I would need a deep understanding of the paper referenced, and that changes to make this happen would diverge from that implementation sufficiently to require a new function entirely. I have my own ideas for how to make this work also that I could try out and contribute back but they won't be peer reviewed

welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.

You might also want to take a look at our Contributing Guide and Code of Conduct.

AndrewAnnex commented 4 years ago

note as a double check I made the orange bars in the histogram transparent and fixed the bins to be consistent and it is indeed still an issue.

leouieda commented 4 years ago

@AndrewAnnex thanks for posting this :+1: Let me see if I understand your problem. What are you trying to predict exactly? Is it the spatial distribution of 2 or more classes/categories? If so, then #261 and #268 will be of interest to you. We can't do it properly just yet since Verde only does regression type models. But #268 would solve this.

The fold imbalance part is another issue. I'm not entirely sure how StratifiedKFold works. It might be a bit tricky to make it work in a blocked version but if you have any ideas I'm more than happy to take a look. It's fine if it's not published. If you come up with something interesting and want to publish it you would have the bonus of already having peer reviewed code to go with it :slightly_smiling_face:

@jessepisel this seems like it's something you might be interested in (or know how to proceed).

leouieda commented 4 years ago

Also, checkout #254 which adds the BlockKFold class. I imagine a BlockStratifiedKFold would look similar in many aspects.

AndrewAnnex commented 4 years ago

@leouieda I am trying to predict the elevation of a given surface layer, essentially given an x,y,z position what stratigraphic surface is present at that position. This is broadly similar to producing a 3d spline interpolation of a surface for a single layer, I have multiple layers so I use the GemPy project currently because it is the one of the few available open source geomodeling packages available. looking at #268, it is essentially a model XYZ -> C, where C is the target or prediction to be made, and there are N possible categories in C. Otherwise it seems that the first case in #261 is basically what I am doing.

As I understand the blockKFold works, you define some spatially disjoint spatial boxes such that when you split the data for the fold into a test and train set you guarantee no mixing along with some criteria such that the test blocks are spatially distributed in some way so they are not all in one corner or another. For a Stratified Block fold, I would imagine that the blocks would need to balanced so that there is an equal proportion of each class in either the test/train set as a whole or for each spatial block (that seems harder).

My idea, although it is just a hunch at the moment, would be to use space filling curves (like a hilbert curve) to provide a 1 dimensional index that could essentially be used to produce another categorical or ordinal column through which the data could be spatially stratified, then a conventional multi-label stratification could be performed using builtin methods in sklearn. Space filling curves can be tuned to create a desired number of uniform "blocks" (to n it is a quad tree like structure...) and there are a few to choose between that have different properties.

I think it could potentially work, if one had enough data points, to first perform the block K fold, then for each fold sub sample the test/train data to equalize the counts of each class, but It depends on what block K fold is really doing as it sounds like it tries to equalize the counts of data for each block?

There is also the imblearn package that implements a number of undersampling techniques, it also has oversampling like SMOTE implemented, but those methods either rely on some form of interpolation or sampling with replacement that I think is undesirable for my use case.