ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
44 stars 23 forks source link

Refactor generalize synthetic data usage and random_elm_store func #150

Closed PeterDSteinberg closed 7 years ago

PeterDSteinberg commented 7 years ago

Currently there is a function called random_elm_store in the elm.sample_util.make_blobs module (previously it was in elm.pipeline.tests.util). This function needs a refactor for simplicity and more generally we should aim to support all or most of the synthetic dataset creators of scikit-learn, perhaps even using the same function names and signatures (but returning an ElmStore rather than numpy array). Doing so will help us in each task as well as in the generalization towards more data structures and ML on CSV / tabular data. Ideally our synthetic data functions that are wrappers around sklearn.datasets could have an extra argument(s) to determine the returned data type among numpy arrays, pandas dataframes, xarray Dataset, ElmStore, etc.

PeterDSteinberg commented 7 years ago

A clarification related new xarray backend classes in earthio PR 12. ElmStore will be deprecated in favor of xarray backend classes. elm will be able to train/predict with those xarray backend classes and eventually numpy arrays, pandas dataframes, and dask arrays and dataframes. In summary, we need a general wrapper interface to the synthetic dataset creators of scikit-learn where we can control whether it returns:

IIRC, each of the sklearn.datasets synthetic data generators returns a 2-D numpy matrix X and some return also the 1-D y corresponding to X. A few more specs for these synthetic data generators:

@gbrener This issue is not necessarily an immediate priority, but it can help our testing all around to have easier synthetic data generators and it may be useful for testing your earthio PR 12 . If it looks like it would make the testing for the earthio upcoming work easier and/or more robust, then you or I or a new team member can work on this as you handle earthio issue 12

PeterDSteinberg commented 7 years ago

The synthetic data generator work has started here in xarray_filters

PeterDSteinberg commented 7 years ago

@gpfreitas made a lot of progress on the xarray_filters.datasets, e.g. wrapping all sklearn.datasets_make_* functions for xarray_filters.MLDataset (a subclass of xarray.Dataset). Work remaining is described in xarray_filters issue 6

PeterDSteinberg commented 7 years ago

This work has been mostly completed, with remaining issues now being created in xarray_filters.