dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
893 stars 255 forks source link

dask-ml make_classification keywords do not work #323

Open js3711 opened 6 years ago

js3711 commented 6 years ago

Supplying the following keywords to dask_ml.datasets.make_classification does not appear to have the expected result.

n_samples, n_informative, n_redundant, n_features

Take a look at the images of the correlation matrix between the features when comparing scikit-learn's make_classification to dask_ml's make_classification.

import dask_ml.datasets as dask_datasets
import sklearn.datasets as sk_datasets
def make_dataset_and_compute_correlation(func, **kwargs):
    X, y = func(**kwargs) 
    ddf_features = dd.from_array(X)

    corr = ddf_features.corr().compute()

    return corr
dask_corr = make_dataset_and_compute_correlation(dask_datasets.make_classification, 
                                                 n_samples=10000, n_informative=12, 
                                                 n_redundant=18, n_features=30, 
                                                 chunks=100)

dask_ml_correlation

vs

sk_corr = make_dataset_and_compute_correlation(sk_datasets.make_classification, 
                                               n_samples=10000, n_informative=12, 
                                               n_redundant=18, n_features=30)

sklearn_correlation

Dependencies

# Name Version Build Channel
appdirs 1.4.3 py36h28b3542_0
appnope 0.1.0 py36hf537a9a_0
asn1crypto 0.24.0 py36_0
attrs 18.1.0 py36_0
automat 0.7.0 py36_0
backcall 0.1.0 py36_0
blas 1.0 mkl
bleach 2.1.3 py36_0
bokeh 0.13.0 py36_0
ca-certificates 2018.03.07 0
cairo 1.14.12 hc4e6be7_4
certifi 2018.4.16 py36_0
cffi 1.11.5 py36h342bebf_0
chardet 3.0.4
click 6.7 py36hec950be_0
cloudpickle 0.5.3 py36_0
constantly 15.1.0 py36h28b3542_0
cryptography 2.2.2 py36h1de35cc_0
cycler 0.10.0 py36hfc81398_0
cytoolz 0.9.0.1 py36h1de35cc_1
dask 0.18.2 py36_0
dask-core 0.18.2 py36_0
dask-glm 0.1.0 py36_0
dask-ml 0.7.0 py36h1de35cc_0
dask-searchcv 0.2.0 py36_0
decorator 4.3.0 py36_0
distributed 1.22.0 py36_0
entrypoints 0.2.3 py36_2
expat 2.2.5 hb8e80ba_0
fontconfig 2.13.0 h5d5b041_1
freetype 2.9.1 hb4e5f40_0
fribidi 1.0.4 h1de35cc_0
gettext 0.19.8.1 h15daf44_3
glib 2.56.1 h35bc53a_0
graphite2 1.3.11 h2098e52_2
graphviz 2.40.1 hefbbd9a_2
harfbuzz 1.7.6 hb8d4a28_3
heapdict 1.0.0 py36_2
html5lib 1.0.1 py36_0
hyperlink 18.0.0 py36_0
icu 58.2 h4b95b61_1
idna 2.7 py36_0
incremental 17.5.0 py36_0
intel-openmp 2018.0.3 0
ipykernel 4.8.2 py36_0
ipython 6.4.0 py36_1
ipython_genutils 0.2.0 py36h241746c_0
ipywidgets 7.3.0 py36_0
jedi 0.12.1 py36_0
jinja2 2.10 py36_0
jpeg 9b he5867d9_2
jsonschema 2.6.0 py36hb385e00_0
jupyter_client 5.2.3 py36_0
jupyter_core 4.4.0 py36_0
jupyterlab 0.32.1 py36_0
jupyterlab_launcher 0.10.5 py36_0
kiwisolver 1.0.1 py36h0a44026_0
libcxx 4.0.1 h579ed51_0
libcxxabi 4.0.1 hebd6815_0
libedit 3.1.20170329 hb402a30_2
libffi 3.2.1 h475c297_4
libgfortran 3.0.1 h93005f0_2
libiconv 1.15 hdd342a3_7
libpng 1.6.34 he12f830_0
libsodium 1.0.16 h3efe00b_0
libtiff 4.0.9 hcb84e12_1
libxml2 2.9.8 hab757c2_1
locket 0.2.0 py36hca03003_1
markupsafe 1.0 py36h1de35cc_1
matplotlib 2.2.2 py36hbf02d85_2
mistune 0.8.3 py36h1de35cc_1
mkl 2018.0.3 1
mkl_fft 1.0.2 py36h6b9c3cc_0
mkl_random 1.0.1 py36h5d10147_1
msgpack-python 0.5.6 py36h04f5b5a_0
multipledispatch 0.5.0 py36_0
nbconvert 5.3.1 py36_0
nbformat 4.4.0 py36h827af21_0
ncurses 6.1 h0a44026_0
networkx 2.1 py36_0
notebook 5.6.0 py36_0
numpy 1.14.5 py36h648b28d_4
numpy-base 1.14.5 py36ha9ae307_4
openssl 1.0.2o h26aff7b_0
packaging 17.1 py36_0
pandas 0.23.3 py36h6440ff4_0
pandoc 2.2.1 h1a437c5_0
pandocfilters 1.4.2 py36_1
pango 1.42.1 he2d0c7e_2
parso 0.3.1 py36_0
partd 0.3.8 py36hf5c4cb8_0
pcre 8.42 h378b8a2_0
pexpect 4.6.0 py36_0
pickleshare 0.7.4 py36hf512f8e_0
pip 10.0.1 py36_0
pixman 0.34.0 hca0a616_3
plotly 3.0.0rc11
prometheus_client 0.2.0 py36_0
prompt_toolkit 1.0.15 py36haeda067_0
psutil 5.4.6 py36h1de35cc_0
ptyprocess 0.6.0 py36_0
pyasn1 0.4.3 py36_0
pyasn1-modules 0.2.2 py36_0
pycparser 2.18 py36_1
pygments 2.2.0 py36h240cd3f_0
pygraphviz 1.3 py36h1de35cc_1
pyopenssl 18.0.0 py36_0
pyparsing 2.2.0 py36_1
python 3.6.6 hc167b69_0
python-dateutil 2.7.3 py36_0
python.app 2 py36_8
pytz 2018.5 py36_0
pyyaml 3.13 py36h1de35cc_0
pyzmq 17.0.0 py36h1de35cc_3
readline 7.0 hc1231fa_4
requests 2.19.1
retrying 1.3.3
scikit-learn 0.19.1 py36hf9f1f73_0
scipy 1.1.0 py36hf1f7d93_0
send2trash 1.5.0 py36_0
service_identity 17.0.0 py36h28b3542_0
setuptools 39.2.0 py36_0
simplegeneric 0.8.1 py36_2
six 1.11.0 py36_1
sortedcontainers 2.0.4 py36_0
sqlite 3.24.0 ha441bb4_0
tblib 1.3.2 py36hda67792_0
terminado 0.8.1 py36_1
testpath 0.3.1 py36h625a49b_0
tk 8.6.7 h35a86e2_3
toolz 0.9.0 py36_0
tornado 5.0.2 py36h1de35cc_0
traitlets 4.3.2 py36h65bd3ce_0
twisted 17.5.0 py36_0
urllib3 1.23
wcwidth 0.1.7 py36h8c6ec74_0
webencodings 0.5.1 py36_1
wheel 0.31.1 py36_0
widgetsnbextension 3.3.0 py36_0
xz 5.2.4 h1de35cc_4
yaml 0.1.7 hc338f04_2
zeromq 4.2.5 h0a44026_0
zict 0.1.3 py36_0
zlib 1.2.11 hf3cbc9b_2
zope 1.0 py36_0
zope.interface 4.5.0 py36h1de35cc_0
js3711 commented 6 years ago

Binder

TomAugspurger commented 6 years ago

Currently n_redundant, n_repeated, n_clusters_per_class, weights, flip_y, class_sep, hypercube, shift, and shuffle have no effect in dask_ml.datasets.make_classification.

Would welcome any improvements here (docs or data generate).

TomAugspurger commented 6 years ago

I think our (unstated) policy is to match the signature, but raise if the keyword is specified but not implemented. So a PR checking for that would also be most welcome.

js3711 commented 6 years ago

Thanks Tom, I can work on that later.

mrocklin commented 6 years ago

It seems reasonable to me to not match the signature if we genuinely don't support the keyword

On Mon, Jul 30, 2018 at 7:27 AM, js3711 notifications@github.com wrote:

Thanks Tom, I can work on that later.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/323#issuecomment-408882607, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJeMhOSG-9Lc7FrJaA2zC-A1h6Vaks5uLxfAgaJpZM4VmS7V .

mrocklin commented 6 years ago

Scikit-Learn functions often have many keywords. Raising informative NotImplementedErrors for all of them sounds potentially painful

On Mon, Jul 30, 2018 at 8:03 AM, Matthew Rocklin mrocklin@anaconda.com wrote:

It seems reasonable to me to not match the signature if we genuinely don't support the keyword

On Mon, Jul 30, 2018 at 7:27 AM, js3711 notifications@github.com wrote:

Thanks Tom, I can work on that later.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/323#issuecomment-408882607, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJeMhOSG-9Lc7FrJaA2zC-A1h6Vaks5uLxfAgaJpZM4VmS7V .

mmccarty commented 4 years ago

@TomAugspurger @mrocklin Just a heads up that I'm working on closing the gaps here.