Epic: elm/earthio refactoring long term plan

PeterDSteinberg commented 7 years ago

Edit June 5, 2017: This issue was originally just about elm.sample_util's refactor, but the issue has become a long term planning epic for elm/earthio - See the bulleted plan comment further down the page

I am curious what the best approach is with the elm.sample_util subpackage:

Considerations:

Separation of concerns: elm is for ML, earthio is for geographic file readers and geographic data structures, like xarray Datasets / DataArrays and reshaping them
Some of elm.sample_util relates to general ML tools like PCA transforms or normalizers from sklearn.preprocessing and other parts of elm.sample_util are related to satellite / Earth science-specific operations, e.g. band normed differences like NDVI, some simple plotting helpers, and 3-D time series or N-D cube reshaping and feature extraction

PeterDSteinberg commented 7 years ago

The move itself can be delayed, but would be good to plan ahead.

PeterDSteinberg commented 7 years ago

@dharhas made a comment on earthio issue 16 that relates to data catalogue that made me realize how to resolve some of the questions above regarding elm.sample_util in a way that is useful for NASA, ERDC, and others.

A few points about current situation:

Much of elm.sample_util is useful outside of ML, but the current challenge is that elm needs the step_mixin.py base class and so does earthio. We want the packages to be optionally coupled, and I'd rather not introduce a 3rd package to solve the which-package-imports-which-package problem nor do within-function imports excessively.
The data catalogue idea of earthio, referenced above, mentions needs for:
- Filters on the data downloaded
- Standardization of downloaded collections of data into xarray.Dataset objects
- Tracking metadata and units (but the comments on that issue discuss how difficult this can be in an automated fashion due to heterogeneity in how/where the metadata and units are reported)
I realized that in order to conveniently use the elm.sample_util filtering logic, it would be ideal to also have the elm.pipeline.Pipeline structure usable outside of ML too, such as chaining together filters before visualization with datashader or zonal statistics summary.
I have also felt that the object oriented approach in elm.pipeline.Pipeline.fit_ensemble and elm.pipeline.Pipeline.fit_ea is currently contrived and should likely be converted to calling the standalone functions fit_ensemble, fit_ea, and predict_many, passing in a Pipeline object or Sequence of them (for the near term) or perhaps a scikit-learn model later. Here's why: currently the Pipeline in elm has a .ensemble attribute that is a list of Pipeline objects. It works okay, but is confusing to inspect for the new user, i.e. do I look at the repr of pipe or the repr of each member of pipe.ensemble and would/should I ever end up with something like pipe.ensemble[0].ensemble[0].ensemble ?

Proposed plan to address the concerns above

The bullets below show an approximate chronological sequencing of the work needed in elm/earthio to meet NASA/ERDC requirements in ML, data downloading, and preprocessing. Note there is a fair amount of work in the viz/UI side of NASA/ERDC funded projects that is not in the plan below, but my thought is that when we need to modify datashader, geoviews, holoviews, etc, the devs on those viz/UI projects can look at this rough elm/earthio plan and adjust viz/UI planning as needed (or vice versa). Also for NASA, the UI work is later in Phase II and it makes more sense to plan that in detail after progress in other areas in 2017 on NASA goals and 2017 ERDC UI work.

[ ] Make a new subpackage(s) in earthio that handles most of the capabilities described in earthio issue 16 - data catalogue (see the next bullet point regarding filters discussed in that earthio issue)
[ ] Move contents of elm.sample_util to earthio in a new subpackage called earthio.filters. We may run into some minor conflicts, such as:
- [ ] elm.sample_util.transform covers PCA, manifold learning, and other sklearn.decomposition transforms and we may want to deal with this by:
  - Importing elm within functions of earthio.filters.transform so that elm is not a required package, OR
  - Seeing if we can get the earthio.filters.transform to just use scikit-learn directly without an elm import (this option is likely preferable to the option above, if feasible)
[ ] Move elm/pipeline/pipeline.py to a new subpackage called earthio.pipeline, then
- [ ] Remove or deprecate temporarily the fit_ensemble, fit_ea, and predict_many methods of the Pipeline base class
- [ ] Write some tests, examples, and docs that ensure we can run a Pipeline of operations in non-ML contexts, such as viz or a statistical analysis:
  - [ ] A Pipeline that uses a custom user-given function
  - [ ] A Pipeline that has steps requiring specific named xarray.DataArray objects to be in the object(s) passed between filter steps ("filter" is a better name than "sample_util"), such as a step requiring an elevation model and a precipitation raster.
  - [ ] A Pipeline using user-given simple functions to address the lack of standardization of units and metadata. As described on earthio issue 16, units/metadata are not really standardized, so the end user will in many cases need to write his/her own custom function to do units conversions or perhaps pull coordinate system info from a non-standard location. By building on xarray.Dataset and DataArrays, we can keep all the units/metadata we can find (and are doing this to some extent now), but the end user may need to do something custom with that metadata.
[ ] Change some of the tests and docs in elm to focus on calling fit_ensemble, fit_ea, and predict_many as functions rather than methods of Pipeline. Alternatively, if anyone has opinions on whether we should use a pattern like pipe.fit_ensemble(*args, **kwargs) vs fit_ensemble(pipeline_instances, *args, **kwargs) I think that would be a good discussion. One option is that earthio leaves placeholder ABC methods for these methods on Pipeline but they can't actually be called unless you have elm installed. The other option is that we just stick with functional approach to fitting/prediction. This bullet point of work should be done pretty much at the same time as the bullet point above regarding Pipeline's move to earthio.
[ ] After the elm/pipeline/pipeline.py move, the name elm.pipeline will not make as much sense. Consider simplifying elm.model_util and the other modules in elm.pipeline in some way and/or renaming. When we are at this point, we should critically review the related dask-searchcv and make issues/PRs that move most of that approach over to elm. Here are some things indask-searchcv I would like to get into elm:
- [ ] Deadline August 26, 2017: dask-searchcv has better organized the model fitting scores and other fitting info, such as grid scoring data structures consistent with scikit-learn conventions. Carrying that over to elm is part of NASA Phase II milestone 2 Improved Tools for Ensemble Fitting and Prediction. That milestone includes cross validation, hierarchical modeling, vote count ensemble averaging, and other multi-model ML options (some capabilities not yet in elm nor dask-searchcv)
- [ ] Deadline August 26, 2017: fit_ensemble, fit_ea, and predict_many should be able to take a multi-model approach to a scikit-learn estimator by itself, as it is done now in dask-searchcv, I think (nowfit_* and predict_many in elm take only Pipeline instances that wrap scikit-learn estimators placed in the final step of a Pipeline).
- [ ] dask-searchcv handles data structures other than xarray.Dataset (this is something I want to carry over to elm but we can defer that part of the transition until later - see the late Phase II milestone 6: Data Structure Flexibility) (Deadline August 25, 2018)
[ ] Un-deprecate the elm-main CLI so that the ML can be driven by a yaml spec. Perhaps elm-main adds some yaml interpretation options on top of the yaml interpretation spec of earthio, where earthio yaml spec covers data download, storage, transformations, and other capabilities described in earthio issue 16 and the elm added parts of the spec cover ensemble, evolutionary algorithm, prediction, statistical feature selection, etc.
[ ] Deadline October 26, 2017: Add some modules to earthio.filters to address the NASA milestone 3: Phase II - Zonal Statistics, Filters, and Change Detection.
[ ] Deadline December 27, 2017: Make issues/PRs for NASA milestone 6 Improved Support for Spectral Clustering / Embedding and Manifold Learning (by this point elm.pipeline and elm.model_selection may have new names and/or package structures.
[ ] Deadline August 25, 2018: Generalize elm to use numpy arrays, pandas dataframes, or dask arrays. This generalization could extend into earthio, but is lower priority there than in elm. The point of generalizing elm is to attract the non-Earth science ML crowd (anyone currently using scikit-learn). This bullet point falls under the Data Structure Flexibility Phase II milestone 6.

dharhas commented 7 years ago

@PeterDSteinberg implementing a pipeline type approach for the filters and being able to chain filters and reuse created chains on new datasets has been a medium term goal of the work ERDC has been doing. I've periodically evaluated what is available in the python world but everything I found was very heavyweight or very domain specific. With the move to xarray as a base data structure the idea of an earthio.pipeline sounds very promising, +100 :)

PeterDSteinberg commented 7 years ago

Cross-post reminder: whatever we do with earthio.pipeline.Pipeline we need to think about how best to use dask parallelism. #143 was related to parallelism. I closed #143 because we have a plan to put the Pipeline in earthio.pipeline, a new subpackage, and can address how best to break up the Pipeline steps in a dask graph when implementing it.

gbrener commented 7 years ago

Orthogonal to where the changes end up residing in the codebase, we might consider using xarray's Dataset.pipe() feature in our implementation. It seems to be geared toward chaining operations in a pipeline fashion.

PeterDSteinberg commented 7 years ago

@gbrener - I agree Dataset.pipe is something we want to be using here. When Pipeline moves to earthio here are a few things to consider:

How do we take advantage of .pipe?
Should we rethink the current Pipeline pattern of how it passes data between steps, e.g. having a single xarray.Dataset passed between steps with special names for y and sample_weight? I say this because it would make interactive usage more natural, as in when running a fit_transform method by itself, and it may help us in taking advantage of the planning and testing of .pipe. Currently an elm.pipeline.Pipeline step may pass data to the next step in one of the following forms:
- X - where is an xarray.Dataset or earthio.ElmStore (ElmStore is soon to be deprecated as part of this epic and our/others' work in xarray)
- (X, y) - where X is as described above and y is a 1-D numpy array
- (X, y, sample_weight) - where X and y are as mentioned above and sample_weight is a 1-D numpy array
- Note we also have Phase II plans to generalize Pipeline for other data structures, e.g. taking advantage of dask-searchcv's distributed ML ideas with numpy arrays. If we run into cases where data structure generalization (supporting dask.array, dask.dataframe, numpy.array, or pandas.DataFrame) limits our hardening of the current approach based on xarray, then I think we should discuss further here and/or with NASA and overall we should prefer xarray support.

ContinuumIO / elm

Epic: elm/earthio refactoring long term plan #149

A few points about current situation:

Proposed plan to address the concerns above