This issue is a checklist related to the degree of support in elm and related tools for for different data structures, sklearn and custom estimators, and parallelism support. This is a an epic for the "Data Structure Flexibility" milestone of Phase II and is related to machine learning flexibility (at least in how we create PR's), but let's try to put most of the planning details in specific issues and keep this one as a long term documentation reminder.
Data Structure Flexibility
Data structures to (ideally) support for most scikit-learn models and custom estimators (in approximate order of priority relative to most milestones' needs):
xarray_filters.MLDataset - From xarray_filters + Elm PR #192 refactor + ...
xarray.Dataset - Converted to an xarray_filter.MLDataset where needed)
xarray.DataArray - When calling MLDataset.to_features()
dask.array - Elm PR #192 began using dask_searchcv base classes to elm with support for dask data structures
dask.dataframe - My thought is that dask.array and dask.dataframe should be essentially interchangeable in elm (not sure if that is the current status of dask_searchcv and related stacks)
numpy.array - This is the type supported by scikit-learn - just included here as a reminder that elm's multi-model machine learning tools, e.g. EaSearchCV need to support numpy except where there are specific methods in elm/xarray_filters/etc that require context metadata or spatial coordinates.
pandas.DataFrame - I'm not sure of the level of pandas support in scikit-learn as I have seen most people work with numpy then assemble inputs/outputs as needed into pandas where needed for pre/postprocessing. For example, sklearn-pandas is a library I haven't tried out personally. Can we address pandas by just converting to dask.dataframe so we are at least dealing with dataframe support in one place?
Caveats:
Not all of the data structures above make sense for every transformer / estimator, e.g.
As we start issues / PRs in elm/ xarray_filters / etc regarding data structure flexibility, let's relate them back to this issue so we can better track exactly which estimators/transformers are having compatibility problems with each data structure.
Parallelism
What are the capabilities and limitations of the parallelism approach for each estimator/transformer and data structure combination? This needs to be better explained in documentation (now and ongoing). For example, with most of elm's current parallelism mainly favors the break-up-the-sample-data-into-separate-embarrassingly-parallel-fitting-jobs approach rather than the single-large-feature matrix approach, but gradually we are also building single-large-feature matrix methods (e.g. the work in dask-glm for large dask data structures - see also daskml).
This issue is a checklist related to the degree of support in
elm
and related tools for for different data structures,sklearn
and custom estimators, and parallelism support. This is a an epic for the "Data Structure Flexibility" milestone of Phase II and is related to machine learning flexibility (at least in how we create PR's), but let's try to put most of the planning details in specific issues and keep this one as a long term documentation reminder.Data Structure Flexibility
Data structures to (ideally) support for most scikit-learn models and custom estimators (in approximate order of priority relative to most milestones' needs):
xarray_filters.MLDataset
- From xarray_filters + Elm PR #192 refactor + ...xarray.Dataset
- Converted to anxarray_filter.MLDataset
where needed)xarray.DataArray
- When callingMLDataset.to_features()
dask.array
- Elm PR #192 began usingdask_searchcv
base classes toelm
with support fordask
data structuresdask.dataframe
- My thought is thatdask.array
anddask.dataframe
should be essentially interchangeable inelm
(not sure if that is the current status ofdask_searchcv
and related stacks)numpy.array
- This is the type supported by scikit-learn - just included here as a reminder thatelm
's multi-model machine learning tools, e.g.EaSearchCV
need to supportnumpy
except where there are specific methods inelm
/xarray_filters
/etc that require context metadata or spatial coordinates.pandas.DataFrame
- I'm not sure of the level ofpandas
support inscikit-learn
as I have seen most people work with numpy then assemble inputs/outputs as needed intopandas
where needed for pre/postprocessing. For example, sklearn-pandas is a library I haven't tried out personally. Can we addresspandas
by just converting todask.dataframe
so we are at least dealing with dataframe support in one place?Caveats:
sklearn.cross_decomposition
module has several estimators that take 2D X and 2D Y.Estimator Flexibility
Support estimators/transformers:
As we start issues / PRs in
elm
/xarray_filters
/ etc regarding data structure flexibility, let's relate them back to this issue so we can better track exactly which estimators/transformers are having compatibility problems with each data structure.Parallelism
What are the capabilities and limitations of the parallelism approach for each estimator/transformer and data structure combination? This needs to be better explained in documentation (now and ongoing). For example, with most of
elm
's current parallelism mainly favors the break-up-the-sample-data-into-separate-embarrassingly-parallel-fitting-jobs approach rather than the single-large-feature matrix approach, but gradually we are also building single-large-feature matrix methods (e.g. the work indask-glm
for large dask data structures - see also daskml).cc @gbrener @hsparra