GeoStat-Framework / PyKrige

Kriging Toolkit for Python
https://pykrige.readthedocs.io
BSD 3-Clause "New" or "Revised" License
740 stars 186 forks source link

[RFC] Data processing pipelines #56

Open rth opened 7 years ago

rth commented 7 years ago

In line with code reorganization in issue #33 , I was wondering what's your opinion on data pipelines. I could be wrong but, I looks like what the current OridnaryKrigging etc.. could be separated in several independent steps,

One could then potentially create, for instance, AnisotropyTransformer, CoordinateTransformer, KrigingEstimator classes (or some other names) and get the results by constructing a pipeline using sklearn.pipeline.Pipeline or sklearn.pipeline.make_pipeline. The advantage of this being that the different transformers

I'm not sure if that would be useful. Among other things this could depend on how much more options we might end up adding. For instance the Universal Krigging class currently has 16 input parameters which is already quite a bit. If we end up adding, say local variogram search radius, kriging search radius and a few others, splitting the processing into several steps might be a way to simplify the interface for the users.

Just a though... What do you think?

bsmurphy commented 7 years ago

I think this is a really good idea, especially as it can cut down a lot on redundancies in the code. You're right, all of the kriging classes rely on a few similar building blocks. Abstracting and consolidating the various building blocks would make code maintenance and future enhancements much easier.

In line with what you're suggesting, I was actually thinking that, regarding issue #52, maybe it would be best to define a separate method/set of methods to calculate the statistics if the user feels the urge to do so. And now that I think about it, having an abstract kriging base class would be good, too (for example, OK is just a special case of UK, they share a lot of things, so breaking things into distinct building blocks would be good from this point of view). So I like this idea from that point of view as well.

I'm not very familiar with the sklearn.pipeline thing, so I'll have to look at that a little more to get a better feel for how it works. In the mean time though, what are you thinking for the steps towards this kind of refactorization?

Also, +1 for cleaning up the interface... all the kwargs are starting to seriously clutter things up...

mjziebarth commented 7 years ago

Seems like a good idea! Just to clear things up for me, as I haven't managed to look into the UK code yet: Does variogram statistics there rely solely on distances as in case of the OK?

Then it should be possible without problems to seperate coordinate transformations. Also, geographic coordinates would be implemented in UK.

bsmurphy commented 7 years ago

Currently, the statistics do rely purely on the variogram/distances, as in the case of OK. That's not technically correct tho, so it's an area for improvement...

bsmurphy commented 7 years ago

Here are some more thoughts...

I do like the sklearn.pipeline idea, as it allows for a great deal of flexibility. However, I also think we should support a "simple" interface like we have currently (sort of). I think this could just look like a wrapper around a pre-set pipeline in the background -- e.g., we could set up a 'normal' ordinary kriging pipeline without anisotropy in a convenience class/method, that way a simple estimation routine is readily accessible for users who don't want to have to go through the trouble of setting up the full pipeline (or at least thinking about setting up the full pipeline... seems to me like having a few quick-and-dirty methods/classes like this would be good).

As I've said here and elsewhere, there are a lot of improvements that could be made to the code. I think isolating individual building blocks in this manner would make implementing those improvements much, much easier (and probably make contributing easier?). So maybe here's what I'd suggest, or envision as a start....

What do you guys think? I can start on an abstract Kriging class and the variogram stuff, as I need those for CPK...

rth commented 7 years ago

@bsmurphy I agree with this. We could keep a global wrapper methods (the real issue is the long and repeated docstrings there, but if we can find a way to generate those from the individual blocks it would quite simple to do).

In the ND refactorization, I was thinking that we might maybe able to mostly keep backward compatibility, but honestly it might be simpler to make a release of v1.4 now, and say that in v2.0 we are going to break backward compatibility with the new API (also because we don't have much development resources for all of it). We could always make a v1.4.1 bug fixing release later. So for instance the approach could be something as follows,

  1. Make the online documentation happen and release the v1.4 (this would involve creating a new git tag 1.4.0 and a new branch 1.4.X where we could apply some bug fixes latter if needed).
  2. Write down the specification for the v2.0 public API, say in a markdown file in doc/api_2.0_specs.md. using Pull requests. We could spend some time discussing the best name, public methods etc in PRs there. This would also leave some time for unhappy with the breaking changes in v2.0 to let us know. For instance the prototype for the anysotropy transformer (to be compatible with pipelines) could look as follows,

    class AnisotropyTransformer(object):
       """ Anysotropy transformer
        Parameters
        ------------
        scaling (float, optional): Scalar stretching value to take
            into account anisotropy. Default is 1 (effectively no stretching).
            Scaling is applied in the y-direction in the rotated data frame
            (i.e., after adjusting for the anisotropy_angle, if anisotropy_angle
            is not 0). This parameter has no effect if coordinate_types is
            set to 'geographic'.
        angle (float, optional): CCW angle (in degrees) by which to
            rotate coordinate system in order to take into account anisotropy.
            Default is 0 (no rotation). Note that the coordinate system is
            rotated. This parameter has no effect if coordinate_types is
            set to 'geographic'.
    
       Attributes  (public attributes)
       ------------
       scaling_ : float
           see above
       angle_: float
           see above
       """"
       def __init__(self, scaling=1, angle=0):
              [..]
    
       def fit(X):
           """
           Fit the anysotropy transformer on input data
           Parameters
           -----------
           X : ndarray [n_samples, n_dim]
               input data
           """
    
       def transform(X, y=None):
           """"
           etc.
           """"

    This way would also think how not yet developped features (search radius, local variogram etc) could fit into this API

    1. Define how we transition from current state to the new API

What do you think?