Open PeterDSteinberg opened 7 years ago
The move itself can be delayed, but would be good to plan ahead.
@dharhas made a comment on earthio issue 16 that relates to data catalogue that made me realize how to resolve some of the questions above regarding elm.sample_util
in a way that is useful for NASA, ERDC, and others.
elm.sample_util
is useful outside of ML, but the current challenge is that elm
needs the step_mixin.py
base class and so does earthio
. We want the packages to be optionally coupled, and I'd rather not introduce a 3rd package to solve the which-package-imports-which-package problem nor do within-function imports excessively.earthio
, referenced above, mentions needs for:
collections
of data into xarray.Dataset
objectselm.sample_util
filtering logic, it would be ideal to also have the elm.pipeline.Pipeline
structure usable outside of ML too, such as chaining together filters before visualization with datashader
or zonal statistics summary.elm.pipeline.Pipeline.fit_ensemble
and elm.pipeline.Pipeline.fit_ea
is currently contrived and should likely be converted to calling the standalone functions fit_ensemble
, fit_ea
, and predict_many
, passing in a Pipeline
object or Sequence
of them (for the near term) or perhaps a scikit-learn model later. Here's why: currently the Pipeline
in elm
has a .ensemble
attribute that is a list of Pipeline
objects. It works okay, but is confusing to inspect for the new user, i.e. do I look at the repr
of pipe
or the repr
of each member of pipe.ensemble
and would/should I ever end up with something like pipe.ensemble[0].ensemble[0].ensemble
?The bullets below show an approximate chronological sequencing of the work needed in elm
/earthio
to meet NASA/ERDC requirements in ML, data downloading, and preprocessing. Note there is a fair amount of work in the viz/UI side of NASA/ERDC funded projects that is not in the plan below, but my thought is that when we need to modify datashader
, geoviews
, holoviews
, etc, the devs on those viz/UI projects can look at this rough elm
/earthio
plan and adjust viz/UI planning as needed (or vice versa). Also for NASA, the UI work is later in Phase II and it makes more sense to plan that in detail after progress in other areas in 2017 on NASA goals and 2017 ERDC UI work.
[ ] Make a new subpackage(s) in earthio
that handles most of the capabilities described in earthio
issue 16 - data catalogue (see the next bullet point regarding filters discussed in that earthio
issue)
[ ] Move contents of elm.sample_util
to earthio
in a new subpackage called earthio.filters
. We may run into some minor conflicts, such as:
elm.sample_util.transform
covers PCA, manifold learning, and other sklearn.decomposition
transforms and we may want to deal with this by:
elm
within functions of earthio.filters.transform
so that elm
is not a required package, ORearthio.filters.transform
to just use scikit-learn directly without an elm
import (this option is likely preferable to the option above, if feasible)[ ] Move elm/pipeline/pipeline.py
to a new subpackage called earthio.pipeline
, then
fit_ensemble
, fit_ea
, and predict_many
methods of the Pipeline
base classPipeline
of operations in non-ML contexts, such as viz or a statistical analysis:
Pipeline
that uses a custom user-given functionPipeline
that has steps requiring specific named xarray.DataArray
objects to be in the object(s) passed between filter steps ("filter" is a better name than "sample_util"), such as a step requiring an elevation model and a precipitation raster. Pipeline
using user-given simple functions to address the lack of standardization of units and metadata. As described on earthio
issue 16, units/metadata are not really standardized, so the end user will in many cases need to write his/her own custom function to do units conversions or perhaps pull coordinate system info from a non-standard location. By building on xarray.Dataset
and DataArray
s, we can keep all the units/metadata we can find (and are doing this to some extent now), but the end user may need to do something custom with that metadata.[ ] Change some of the tests and docs in elm
to focus on calling fit_ensemble
, fit_ea
, and predict_many
as functions rather than methods of Pipeline
. Alternatively, if anyone has opinions on whether we should use a pattern like pipe.fit_ensemble(*args, **kwargs)
vs fit_ensemble(pipeline_instances, *args, **kwargs)
I think that would be a good discussion. One option is that earthio
leaves placeholder ABC methods for these methods on Pipeline
but they can't actually be called unless you have elm
installed. The other option is that we just stick with functional approach to fitting/prediction. This bullet point of work should be done pretty much at the same time as the bullet point above regarding Pipeline
's move to earthio
.
[ ] After the elm/pipeline/pipeline.py
move, the name elm.pipeline
will not make as much sense. Consider simplifying elm.model_util
and the other modules in elm.pipeline
in some way and/or renaming. When we are at this point, we should critically review the related dask-searchcv and make issues/PRs that move most of that approach over to elm
. Here are some things indask-searchcv
I would like to get into elm
:
dask-searchcv
has better organized the model fitting scores and other fitting info, such as grid scoring data structures consistent with scikit-learn conventions. Carrying that over to elm
is part of NASA Phase II milestone 2 Improved Tools for Ensemble Fitting and Prediction. That milestone includes cross validation, hierarchical modeling, vote count ensemble averaging, and other multi-model ML options (some capabilities not yet in elm
nor dask-searchcv
)fit_ensemble
, fit_ea
, and predict_many
should be able to take a multi-model approach to a scikit-learn estimator by itself, as it is done now in dask-searchcv
, I think (nowfit_*
and predict_many
in elm
take only Pipeline
instances that wrap scikit-learn estimators placed in the final step of a Pipeline
). dask-searchcv
handles data structures other than xarray.Dataset
(this is something I want to carry over to elm
but we can defer that part of the transition until later - see the late Phase II milestone 6: Data Structure Flexibility) (Deadline August 25, 2018)[ ] Un-deprecate the elm-main
CLI so that the ML can be driven by a yaml spec. Perhaps elm-main
adds some yaml interpretation options on top of the yaml
interpretation spec of earthio
, where earthio
yaml spec covers data download, storage, transformations, and other capabilities described in earthio
issue 16 and the elm
added parts of the spec cover ensemble, evolutionary algorithm, prediction, statistical feature selection, etc.
[ ] Deadline October 26, 2017: Add some modules to earthio.filters
to address the NASA milestone 3:
Phase II - Zonal Statistics, Filters, and Change Detection.
[ ] Deadline December 27, 2017: Make issues/PRs for NASA milestone 6
Improved Support for Spectral Clustering / Embedding and Manifold Learning (by this point elm.pipeline
and elm.model_selection
may have new names and/or package structures.
[ ] Deadline August 25, 2018: Generalize elm
to use numpy arrays, pandas dataframes, or dask arrays. This generalization could extend into earthio
, but is lower priority there than in elm
. The point of generalizing elm
is to attract the non-Earth science ML crowd (anyone currently using scikit-learn). This bullet point falls under the Data Structure Flexibility Phase II milestone 6.
@PeterDSteinberg implementing a pipeline type approach for the filters and being able to chain filters and reuse created chains on new datasets has been a medium term goal of the work ERDC has been doing. I've periodically evaluated what is available in the python world but everything I found was very heavyweight or very domain specific. With the move to xarray as a base data structure the idea of an earthio.pipeline sounds very promising, +100 :)
Cross-post reminder: whatever we do with earthio.pipeline.Pipeline
we need to think about how best to use dask
parallelism. #143 was related to parallelism. I closed #143 because we have a plan to put the Pipeline
in earthio.pipeline
, a new subpackage, and can address how best to break up the Pipeline
steps in a dask graph when implementing it.
Orthogonal to where the changes end up residing in the codebase, we might consider using xarray's Dataset.pipe() feature in our implementation. It seems to be geared toward chaining operations in a pipeline fashion.
@gbrener - I agree Dataset.pipe
is something we want to be using here. When Pipeline
moves to earthio
here are a few things to consider:
.pipe
? Pipeline
pattern of how it passes data between steps, e.g. having a single xarray.Dataset
passed between steps with special names for y
and sample_weight
? I say this because it would make interactive usage more natural, as in when running a fit_transform
method by itself, and it may help us in taking advantage of the planning and testing of .pipe
. Currently an elm.pipeline.Pipeline
step may pass data to the next step in one of the following forms:
X
- where is an xarray.Dataset
or earthio.ElmStore
(ElmStore
is soon to be deprecated as part of this epic and our/others' work in xarray
)(X, y)
- where X
is as described above and y
is a 1-D numpy array(X, y, sample_weight)
- where X
and y
are as mentioned above and sample_weight
is a 1-D numpy arrayPipeline
for other data structures, e.g. taking advantage of dask-searchcv
's distributed ML ideas with numpy arrays. If we run into cases where data structure generalization (supporting dask.array
, dask.dataframe
, numpy.array
, or pandas.DataFrame
) limits our hardening of the current approach based on xarray
, then I think we should discuss further here and/or with NASA and overall we should prefer xarray
support.
Edit June 5, 2017: This issue was originally just about
elm.sample_util
's refactor, but the issue has become a long term planning epic forelm
/earthio
- See the bulleted plan comment further down the pageI am curious what the best approach is with the
elm.sample_util
subpackage:Considerations:
elm
is for ML,earthio
is for geographic file readers and geographic data structures, like xarray Datasets / DataArrays and reshaping themelm.sample_util
relates to general ML tools like PCA transforms or normalizers fromsklearn.preprocessing
and other parts ofelm.sample_util
are related to satellite / Earth science-specific operations, e.g. band normed differences like NDVI, some simple plotting helpers, and 3-D time series or N-D cube reshaping and feature extraction