Open PeterDSteinberg opened 7 years ago
cc @gpfreitas
I introduced some TODOs into @gpfreitas PR. In order to get the tests passing, we needed to:
chunk
keyword arguments from some of the functionsHere is the commit where the TODO locations were introduced: https://github.com/ContinuumIO/xarray_filters/pull/20/commits/5e3c7569e0865534f44f5269d57b20964ebafb76
Comments from @gpfreitas on the subject:
[Assuming that] at some point in time, you will switch wholesale from sklearn.datasets.make_ functions to daskml.datasets.make functions. The big pain in the code is trying to support both at the same time, depending on which one is implemented where. (I was doing a simple priority rule: if make_whatever is implemented in dask_ml, use that, otherwise, look in sklearn. You could keep looking at more packages by adding packages to that list sampling_source_packages.) But that creates a pain with tests because some keywords are required in a dask backend (like chunks), while not valid in sklearn. I wanted something that would be used like this:
from sklearn.datasets import make_blobs from xarray_filters.datasets import Convert # currently NpXyTransformer X, y = make_blobs(n_features=..., n_samples=...) ds = Convert(X, y).to_mldataset(arg1=.., arg2=...) # or to_frame, to_array, etc. type-specific args that you can see in the signature of the `to_*` method.
and if you want the data coming from dask everything would be the same, except the first line, which would be:
from dask_ml.datasets import make_blobs
Peter wants it to be more like this:
from xarray_filters.datasets import make_blobs ds = make_blobs(n_features=..., n_samples=..., use_dask=True, astype='mldataset', **kwargs) # use_dask=False for sklearn backend, you have to figure out which kwargs are valid, and that it depends on `astype`
In the end, it seemed to me the best way forward was to use the API I wanted as a "low-level" API, that is more strict (no *args, **kwargs) and use that to implement the higher level api that Peter wanted (through a wrapping function called _make_base
in the code that tries to generate functions with useful signatures and docstrings, at least in python 3).
The main issue at the API level here is that there are keyword arguments in the various xarray_filters.datasets.make_*
functions that depends on two different things (one is already confusing): the type of the underlying X,y
arrays (dask vs numpy) and the data structure you are converting to (mldataset, xarray.Dataset, pandas.DataFrame, dask.dataframe, given by what you pass to the astype
keyword when calling xarray_filters.datasets.make_*
).
Example:
make_blobs(..., chunks=2, astype='frame', columns=...)
chunks
can only be used because the underlying arrays are dask arrays. It would error with NumPy arrays (which is what sklearn returns).columns
can only be used because we are saying astype='frame'
; it's a dataframe-specific keyword argument. Pandas or dask make no difference.This is not exclusive to the dataframe conversion. Converting to MLDatasets or xarray.Dataset objs have the same issue (naming dimensions, etc.)
This may be the best we can do to provide a simple one-liner for generating data, but I'm not sure. I don't have any ideas that are strictly better than what we have.
After speaking with @gpfreitas, it might make sense for us to refactor our approach, so that Python 2 compatibility and integration with dask-ml become simpler. Here are some ideas:
make_blobs
, make_classification
, make_regression
, etc...use_dask=True
(or whatever makes sense from an API standpoint)
See this issue 67 in dask-ml. When that issue is addressed, fix the dask-ml imports in xarray_filters and their usage in datasets.py of this repo. We had intended to cover that in #20 but needed to cover dask-ml 67 first.