ContinuumIO / xarray_filters

A Pipeline approach to chaining common xarray data structure conversions
3 stars 10 forks source link

[WIP] Major fixes to datasets.py #20

Closed gpfreitas closed 7 years ago

gpfreitas commented 7 years ago

Addressing issue #17.

gpfreitas commented 7 years ago

Right now (commit 08e99e3) the test suite passes if we use only functions from sklearn. You can check that yourself, by changing the line

https://github.com/ContinuumIO/xarray_filters/blob/08e99e3b2a01992155b8c1559af703acf309bc26/xarray_filters/datasets.py#L571

to _sampling_source_packages = [dask_ml.datasets, sklearn.datasets][1:] and rerunning the test suite (just pytest from the root of the repo).

Using the dask_ml backends (so using the code from that commit as written), we get some failures in the test suite (including unit and doctests):

============================= test session starts ==============================
platform darwin -- Python 3.5.4, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /Users/gpfreitas/gh/ContinuumIO/xarray_filters, inifile: pytest.ini
plugins: cov-2.3.1
collected 22 items

xarray_filters/datasets.py ..FF
xarray_filters/reshape.py ...
xarray_filters/tests/test_datasets.py .
xarray_filters/tests/test_pipeline.py ..
xarray_filters/tests/test_reshape.py ........
xarray_filters/tests/test_ts_grid_tools.py FFFF

The failures in datasets.py all have to do with the fact that you can't do

df['y'] = ...something

to add a column 'y' when df is a dask dataframe backed by a dask array. So we need a workaround here.

Many of the other failures (in test_ts_grid_tools.py) come from a missing chunks argument (as dask_ml is the default backend, and its sampling functions in dask_ml.datasets require a chunks argument).

gpfreitas commented 7 years ago

Also, the original code for datasets.py assumed only sklearn.datasets functions would be used. That assumption is reflected in docstrings and variable names (like skl_sampler_func). Now that we use multiple backends, that should be changed.

gpfreitas commented 7 years ago

The name NpXyTransformer is also not great: it's weird, and it doens't communicate the right assumptions anymore. It was originally initialized by a pair of X, y numpy arrays, but we were planning on having the class be initialized by any number of arrays with the same size in the first dimension, be them NumPy or dask arrays (it already works with dask arrays, except for the to_frame method, as mentioned above). Something like "DataConverter" would be nicer.

gpfreitas commented 7 years ago

@PeterDSteinberg

I'd suggest merging this because all the functionality related to MLDatasets seems to work.

The remaining problems listed above could be addressed in other issues.

If we want tests to pass before merging, we could do the little change that makes it support just the sklearn.datasets functions.

gbrener commented 7 years ago

Just fixed the outstanding merge conflicts after speaking to @gpfreitas .

PeterDSteinberg commented 7 years ago

I made a reminder issue for us to come back and fix any temporary dask-ml fixes we do here:

36