ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
44 stars 23 forks source link

Custom estimators / sklearn / dask and data structure flexibility checklist #201

Closed PeterDSteinberg closed 7 years ago

PeterDSteinberg commented 7 years ago

This issue is a checklist related to the degree of support in elm and related tools for for different data structures, sklearn and custom estimators, and parallelism support. This is a an epic for the "Data Structure Flexibility" milestone of Phase II and is related to machine learning flexibility (at least in how we create PR's), but let's try to put most of the planning details in specific issues and keep this one as a long term documentation reminder.

Data Structure Flexibility

Data structures to (ideally) support for most scikit-learn models and custom estimators (in approximate order of priority relative to most milestones' needs):

Caveats:

Estimator Flexibility

Support estimators/transformers:

As we start issues / PRs in elm/ xarray_filters / etc regarding data structure flexibility, let's relate them back to this issue so we can better track exactly which estimators/transformers are having compatibility problems with each data structure.

Parallelism

What are the capabilities and limitations of the parallelism approach for each estimator/transformer and data structure combination? This needs to be better explained in documentation (now and ongoing). For example, with most of elm's current parallelism mainly favors the break-up-the-sample-data-into-separate-embarrassingly-parallel-fitting-jobs approach rather than the single-large-feature matrix approach, but gradually we are also building single-large-feature matrix methods (e.g. the work in dask-glm for large dask data structures - see also daskml).

cc @gbrener @hsparra