Closed PeterDSteinberg closed 7 years ago
A clarification related new xarray backend classes in earthio PR 12. ElmStore
will be deprecated in favor of xarray
backend classes. elm
will be able to train/predict with those xarray backend classes and eventually numpy arrays, pandas dataframes, and dask arrays and dataframes. In summary, we need a general wrapper interface to the synthetic dataset creators of scikit-learn where we can control whether it returns:
IIRC, each of the sklearn.datasets
synthetic data generators returns a 2-D numpy matrix X
and some return also the 1-D y
corresponding to X
. A few more specs for these synthetic data generators:
xarray.Dataset
, we should be able to add the .canvas
attribute with reasonable synthetic data for the geo_transform
and/or other metadata, e.g. typical LANDSAT radiance multipliers and offsets. It's not so important that the data are realistic, but more that they have the right attributes/structures for testing.sklearn.datasets
equivalent methods@gbrener This issue is not necessarily an immediate priority, but it can help our testing all around to have easier synthetic data generators and it may be useful for testing your earthio PR 12 . If it looks like it would make the testing for the earthio
upcoming work easier and/or more robust, then you or I or a new team member can work on this as you handle earthio issue 12
The synthetic data generator work has started here in xarray_filters
@gpfreitas made a lot of progress on the xarray_filters.datasets
, e.g. wrapping all sklearn.datasets_make_*
functions for xarray_filters.MLDataset
(a subclass of xarray.Dataset
). Work remaining is described in xarray_filters
issue 6
This work has been mostly completed, with remaining issues now being created in xarray_filters.
Currently there is a function called random_elm_store in the
elm.sample_util.make_blobs
module (previously it was inelm.pipeline.tests.util
). This function needs a refactor for simplicity and more generally we should aim to support all or most of the synthetic dataset creators of scikit-learn, perhaps even using the same function names and signatures (but returning anElmStore
rather thannumpy
array). Doing so will help us in each task as well as in the generalization towards more data structures and ML on CSV / tabular data. Ideally our synthetic data functions that are wrappers aroundsklearn.datasets
could have an extra argument(s) to determine the returned data type among numpy arrays, pandas dataframes, xarray Dataset, ElmStore, etc.