[RFC] Data processing pipelines

In line with code reorganization in issue #33 , I was wondering what's your opinion on data pipelines. I could be wrong but, I looks like what the current OridnaryKrigging etc.. could be separated in several independent steps,

(optional) anisotropy correction coordinate transformation
(optional) optional geographic coordinate transformation
actual kriging (not sure if drift could be separated in a separate step..)

One could then potentially create, for instance, AnisotropyTransformer, CoordinateTransformer, KrigingEstimator classes (or some other names) and get the results by constructing a pipeline using sklearn.pipeline.Pipeline or sklearn.pipeline.make_pipeline. The advantage of this being that the different transformers

could then be simpler (with few input parameters), and users would only use the elementary bricks they need
can be reused for different krigging types.
can be more easily customized by subclassing different steps
can be tested independently

I'm not sure if that would be useful. Among other things this could depend on how much more options we might end up adding. For instance the Universal Krigging class currently has 16 input parameters which is already quite a bit. If we end up adding, say local variogram search radius, kriging search radius and a few others, splitting the processing into several steps might be a way to simplify the interface for the users.

Just a though... What do you think?

I think this is a really good idea, especially as it can cut down a lot on redundancies in the code. You're right, all of the kriging classes rely on a few similar building blocks. Abstracting and consolidating the various building blocks would make code maintenance and future enhancements much easier.

In line with what you're suggesting, I was actually thinking that, regarding issue #52, maybe it would be best to define a separate method/set of methods to calculate the statistics if the user feels the urge to do so. And now that I think about it, having an abstract kriging base class would be good, too (for example, OK is just a special case of UK, they share a lot of things, so breaking things into distinct building blocks would be good from this point of view). So I like this idea from that point of view as well.

I'm not very familiar with the sklearn.pipeline thing, so I'll have to look at that a little more to get a better feel for how it works. In the mean time though, what are you thinking for the steps towards this kind of refactorization?

Also, +1 for cleaning up the interface... all the kwargs are starting to seriously clutter things up...

Seems like a good idea! Just to clear things up for me, as I haven't managed to look into the UK code yet: Does variogram statistics there rely solely on distances as in case of the OK?

Then it should be possible without problems to seperate coordinate transformations. Also, geographic coordinates would be implemented in UK.

Currently, the statistics do rely purely on the variogram/distances, as in the case of OK. That's not technically correct tho, so it's an area for improvement...

Here are some more thoughts...

I do like the sklearn.pipeline idea, as it allows for a great deal of flexibility. However, I also think we should support a "simple" interface like we have currently (sort of). I think this could just look like a wrapper around a pre-set pipeline in the background -- e.g., we could set up a 'normal' ordinary kriging pipeline without anisotropy in a convenience class/method, that way a simple estimation routine is readily accessible for users who don't want to have to go through the trouble of setting up the full pipeline (or at least thinking about setting up the full pipeline... seems to me like having a few quick-and-dirty methods/classes like this would be good).

As I've said here and elsewhere, there are a lot of improvements that could be made to the code. I think isolating individual building blocks in this manner would make implementing those improvements much, much easier (and probably make contributing easier?). So maybe here's what I'd suggest, or envision as a start....

Create a Kriging abstract/top-level class that has a lot of the common tools built in (e.g., an ND __init__, methods for controlling verbosity/logging, methods for setting up the variogram model).
Create something like an EstimateVariogramModel class/method, which can do all the dirty work associated with automatically estimating the variogram model if the user so desires. The output could just be a dict that would then be fed into the derived Kriging class. (So if the user already knows what variogram parameters to use, he/she could just feed them directly in as a list.) This would clean up and isolate all the variogram entry and estimation business. Alternatively (or in addition), we could create a whole variogram object, which once set up would just kick out the semivariance when some class method is called. This could be useful for cleaning up the internals... (e.g., I've been looking at implementing Continuous Part Kriging, as I've found a need for it in some work I'm doing now... Creating an independent variogram function object would make implementing CPK somewhat easier I think...)
Make an AnisotropyTransformer class/method, as mentioned above. Users would have to make sure to use this before automatically estimating the variogram model, but that could easily be added in as a warning someplace if it's not done correctly...
Extend the geographic coordinate feature to be another "transformer" object/method. This would make extending it to UK etc. much easier... (Also, @mjziebarth, I was thinking... maybe for completeness, in doing this we can allow the option to specify two dimensions on ND datasets as lat/lon, then the other dimensions will be left alone and kriging can continue... for example, if you have lat/lon as well as altitude... might be more complicated tho the more I think about this...)
Separate the statistics calculations into another class object or function, that way all the work that needs to be done there can be done independently. (My ND refactor was a good start, but still more work to do probably... making it a stand-alone unit would make improvements easier.)
The only real difference between UK and OK lies in including the drift terms, so implementing some kind of "DriftControl" object/method, either in a derived UK class or as a stand-alone feature, would make implementing UK easy.

What do you guys think? I can start on an abstract Kriging class and the variogram stuff, as I need those for CPK...

@bsmurphy I agree with this. We could keep a global wrapper methods (the real issue is the long and repeated docstrings there, but if we can find a way to generate those from the individual blocks it would quite simple to do).

In the ND refactorization, I was thinking that we might maybe able to mostly keep backward compatibility, but honestly it might be simpler to make a release of v1.4 now, and say that in v2.0 we are going to break backward compatibility with the new API (also because we don't have much development resources for all of it). We could always make a v1.4.1 bug fixing release later. So for instance the approach could be something as follows,

Make the online documentation happen and release the v1.4 (this would involve creating a new git tag 1.4.0 and a new branch 1.4.X where we could apply some bug fixes latter if needed).

Write down the specification for the v2.0 public API, say in a markdown file in doc/api_2.0_specs.md. using Pull requests. We could spend some time discussing the best name, public methods etc in PRs there. This would also leave some time for unhappy with the breaking changes in v2.0 to let us know. For instance the prototype for the anysotropy transformer (to be compatible with pipelines) could look as follows,

class AnisotropyTransformer(object):
   """ Anysotropy transformer
    Parameters
    ------------
    scaling (float, optional): Scalar stretching value to take
        into account anisotropy. Default is 1 (effectively no stretching).
        Scaling is applied in the y-direction in the rotated data frame
        (i.e., after adjusting for the anisotropy_angle, if anisotropy_angle
        is not 0). This parameter has no effect if coordinate_types is
        set to 'geographic'.
    angle (float, optional): CCW angle (in degrees) by which to
        rotate coordinate system in order to take into account anisotropy.
        Default is 0 (no rotation). Note that the coordinate system is
        rotated. This parameter has no effect if coordinate_types is
        set to 'geographic'.

   Attributes  (public attributes)
   ------------
   scaling_ : float
       see above
   angle_: float
       see above
   """"
   def __init__(self, scaling=1, angle=0):
          [..]

   def fit(X):
       """
       Fit the anysotropy transformer on input data
       Parameters
       -----------
       X : ndarray [n_samples, n_dim]
           input data
       """

   def transform(X, y=None):
       """"
       etc.
       """"

This way would also think how not yet developped features (search radius, local variogram etc) could fit into this API

Define how we transition from current state to the new API

What do you think?

GeoStat-Framework / PyKrige

[RFC] Data processing pipelines #56