ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
43 stars 27 forks source link

Epic: elm/earthio refactoring long term plan #149

Open PeterDSteinberg opened 7 years ago

PeterDSteinberg commented 7 years ago

Edit June 5, 2017: This issue was originally just about elm.sample_util's refactor, but the issue has become a long term planning epic for elm/earthio - See the bulleted plan comment further down the page

I am curious what the best approach is with the elm.sample_util subpackage:

Considerations:

PeterDSteinberg commented 7 years ago

The move itself can be delayed, but would be good to plan ahead.

PeterDSteinberg commented 7 years ago

@dharhas made a comment on earthio issue 16 that relates to data catalogue that made me realize how to resolve some of the questions above regarding elm.sample_util in a way that is useful for NASA, ERDC, and others.

A few points about current situation:

Proposed plan to address the concerns above

The bullets below show an approximate chronological sequencing of the work needed in elm/earthio to meet NASA/ERDC requirements in ML, data downloading, and preprocessing. Note there is a fair amount of work in the viz/UI side of NASA/ERDC funded projects that is not in the plan below, but my thought is that when we need to modify datashader, geoviews, holoviews, etc, the devs on those viz/UI projects can look at this rough elm/earthio plan and adjust viz/UI planning as needed (or vice versa). Also for NASA, the UI work is later in Phase II and it makes more sense to plan that in detail after progress in other areas in 2017 on NASA goals and 2017 ERDC UI work.

dharhas commented 7 years ago

@PeterDSteinberg implementing a pipeline type approach for the filters and being able to chain filters and reuse created chains on new datasets has been a medium term goal of the work ERDC has been doing. I've periodically evaluated what is available in the python world but everything I found was very heavyweight or very domain specific. With the move to xarray as a base data structure the idea of an earthio.pipeline sounds very promising, +100 :)

PeterDSteinberg commented 7 years ago

Cross-post reminder: whatever we do with earthio.pipeline.Pipeline we need to think about how best to use dask parallelism. #143 was related to parallelism. I closed #143 because we have a plan to put the Pipeline in earthio.pipeline, a new subpackage, and can address how best to break up the Pipeline steps in a dask graph when implementing it.

gbrener commented 7 years ago

Orthogonal to where the changes end up residing in the codebase, we might consider using xarray's Dataset.pipe() feature in our implementation. It seems to be geared toward chaining operations in a pipeline fashion.

PeterDSteinberg commented 7 years ago

@gbrener - I agree Dataset.pipe is something we want to be using here. When Pipeline moves to earthio here are a few things to consider: