Dask-ml datasets.py related changes

PeterDSteinberg commented 7 years ago

See this issue 67 in dask-ml. When that issue is addressed, fix the dask-ml imports in xarray_filters and their usage in datasets.py of this repo. We had intended to cover that in #20 but needed to cover dask-ml 67 first.

gbrener commented 7 years ago

cc @gpfreitas

gbrener commented 7 years ago

I introduced some TODOs into @gpfreitas PR. In order to get the tests passing, we needed to:

[x] change the priorities of dask_ml and scikit-learn methods
[x] remove the chunk keyword arguments from some of the functions

Here is the commit where the TODO locations were introduced: https://github.com/ContinuumIO/xarray_filters/pull/20/commits/5e3c7569e0865534f44f5269d57b20964ebafb76

Comments from @gpfreitas on the subject:

[Assuming that] at some point in time, you will switch wholesale from sklearn.datasets.make_ functions to daskml.datasets.make functions. The big pain in the code is trying to support both at the same time, depending on which one is implemented where. (I was doing a simple priority rule: if make_whatever is implemented in dask_ml, use that, otherwise, look in sklearn. You could keep looking at more packages by adding packages to that list sampling_source_packages.) But that creates a pain with tests because some keywords are required in a dask backend (like chunks), while not valid in sklearn. I wanted something that would be used like this:
from sklearn.datasets import make_blobs
from xarray_filters.datasets import Convert  # currently NpXyTransformer
X, y = make_blobs(n_features=..., n_samples=...)
ds = Convert(X, y).to_mldataset(arg1=.., arg2=...)  # or to_frame, to_array, etc.  type-specific args that you can see in the signature of the `to_*` method.
and if you want the data coming from dask everything would be the same, except the first line, which would be:
from dask_ml.datasets import make_blobs
Peter wants it to be more like this:
from xarray_filters.datasets import make_blobs
ds = make_blobs(n_features=..., n_samples=..., use_dask=True, astype='mldataset', **kwargs)  # use_dask=False for sklearn backend, you have to figure out which kwargs are valid, and that it depends on `astype` 

gpfreitas commented 7 years ago

In the end, it seemed to me the best way forward was to use the API I wanted as a "low-level" API, that is more strict (no *args, **kwargs) and use that to implement the higher level api that Peter wanted (through a wrapping function called _make_base in the code that tries to generate functions with useful signatures and docstrings, at least in python 3).

The main issue at the API level here is that there are keyword arguments in the various xarray_filters.datasets.make_* functions that depends on two different things (one is already confusing): the type of the underlying X,y arrays (dask vs numpy) and the data structure you are converting to (mldataset, xarray.Dataset, pandas.DataFrame, dask.dataframe, given by what you pass to the astype keyword when calling xarray_filters.datasets.make_*).

Example:

make_blobs(..., chunks=2, astype='frame', columns=...)

chunks can only be used because the underlying arrays are dask arrays. It would error with NumPy arrays (which is what sklearn returns).
columns can only be used because we are saying astype='frame'; it's a dataframe-specific keyword argument. Pandas or dask make no difference.

This is not exclusive to the dataframe conversion. Converting to MLDatasets or xarray.Dataset objs have the same issue (naming dimensions, etc.)

This may be the best we can do to provide a simple one-liner for generating data, but I'm not sure. I don't have any ideas that are strictly better than what we have.

gbrener commented 7 years ago

After speaking with @gpfreitas, it might make sense for us to refactor our approach, so that Python 2 compatibility and integration with dask-ml become simpler. Here are some ideas:

Remove NpXyTransformer in favor of conditional branches on the keyword arguments in make_blobs, make_classification, make_regression, etc...
Use the newly-simplified kwarg-handling logic to introduce use_dask=True (or whatever makes sense from an API standpoint)
Have the scikit-learn methods fall back to dask-ml's API when possible, instead of returning the wrapper function

ContinuumIO / xarray_filters

Dask-ml datasets.py related changes #36