fatiando / verde

Processing and gridding spatial data, machine-learning style
https://www.fatiando.org/verde
BSD 3-Clause "New" or "Revised" License
599 stars 72 forks source link

Gridding of XYC categorical data #261

Open mycarta opened 4 years ago

mycarta commented 4 years ago

Description of the desired feature

Being able to grid XYC categorical data (where C is the categorical feature) would be very useful. I can think of at least two use cases:

  1. LABELED DATA 1.1. Gridding of categorical geological data: for example a well performance classification that is not a production number, (high, medium, low), or a fracture intensity or other rock quality classification. Cross-validation would be important; in the case of well performance it may be nice to have the option for block cross-validation since wells often are drilled in clusters with relatively uniform reservoir (intra-area), but not necessarily homogeneous among clusters (inter-area). 1.2. gridding of geological facies. This is often done in the context of 3D geocellular modelling, but having a 2D implementation in Python would be great, with both options for cross-validation and using weights (if facies probabilities are available).

  2. UNLABELED data I am thinking here numerical categories such as output from clustering done with Gaussian Mixture Model. Data would be in XYCP format, where P is the probability output, and it would be great to be able to grid it using the probability as a weight. In this case cross-validation would not be possible because there is no label to us as ground truth.

Are you willing to help implement and maintain this feature? Yes/No

No. In the sense that I would not be available for coding; but I would definitely be interested and available as a tester.

welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! Please make sure you filled out the template with as much detail as possible.

You might also want to take a look at our Contributing Guide and Code of Conduct.

leouieda commented 4 years ago

@mycarta that's an interesting use case. This might be a bit challenging because we're then getting into spatial prediction of things that aren't well represented by a surface under a load. So it's likely that the best predictors wouldn't be the coordinates of the points. Instead, you'd likely want to use other features. This is related to #188 by @fmaussion. I understand the use case better now and might be able to form ideas on a possible implementation.

So what we would need is a way to wrap a scikit-learn estimator into a Verde gridder. This shouldn't be too hard. The assumption would be that the feature matrix is a column stack of the given "coordinates". See #268. I think that could be a general solution for this.

Having the estimator wrapped by a gridder would allow use of any of our cross-validation tools.