Refactor generalize synthetic data usage and random_elm_store func

PeterDSteinberg commented 7 years ago

Currently there is a function called random_elm_store in the elm.sample_util.make_blobs module (previously it was in elm.pipeline.tests.util). This function needs a refactor for simplicity and more generally we should aim to support all or most of the synthetic dataset creators of scikit-learn, perhaps even using the same function names and signatures (but returning an ElmStore rather than numpy array). Doing so will help us in each task as well as in the generalization towards more data structures and ML on CSV / tabular data. Ideally our synthetic data functions that are wrappers around sklearn.datasets could have an extra argument(s) to determine the returned data type among numpy arrays, pandas dataframes, xarray Dataset, ElmStore, etc.

PeterDSteinberg commented 7 years ago

A clarification related new xarray backend classes in earthio PR 12. ElmStore will be deprecated in favor of xarray backend classes. elm will be able to train/predict with those xarray backend classes and eventually numpy arrays, pandas dataframes, and dask arrays and dataframes. In summary, we need a general wrapper interface to the synthetic dataset creators of scikit-learn where we can control whether it returns:

[x] any one of the xarray backend classes @gbrener is working on
[ ] pandas dataframe
[ ] dask dataframe
[ ] dask array
[x] numpy array

IIRC, each of the sklearn.datasets synthetic data generators returns a 2-D numpy matrix X and some return also the 1-D y corresponding to X. A few more specs for these synthetic data generators:

[ ] Add an optional argument to the dataset creation functions that is True/False whether some typical spatial metadata should be added to the data structure, where feasible. For example, if creating a synthetic xarray.Dataset, we should be able to add the .canvas attribute with reasonable synthetic data for the geo_transform and/or other metadata, e.g. typical LANDSAT radiance multipliers and offsets. It's not so important that the data are realistic, but more that they have the right attributes/structures for testing.
[x] Make the synthetic data generators have same arg spec as sklearn.datasets equivalent methods

@gbrener This issue is not necessarily an immediate priority, but it can help our testing all around to have easier synthetic data generators and it may be useful for testing your earthio PR 12 . If it looks like it would make the testing for the earthio upcoming work easier and/or more robust, then you or I or a new team member can work on this as you handle earthio issue 12

PeterDSteinberg commented 7 years ago

The synthetic data generator work has started here in xarray_filters

PeterDSteinberg commented 7 years ago

@gpfreitas made a lot of progress on the xarray_filters.datasets, e.g. wrapping all sklearn.datasets_make_* functions for xarray_filters.MLDataset (a subclass of xarray.Dataset). Work remaining is described in xarray_filters issue 6

PeterDSteinberg commented 7 years ago

This work has been mostly completed, with remaining issues now being created in xarray_filters.

ContinuumIO / elm

Refactor generalize synthetic data usage and random_elm_store func #150