dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
906 stars 256 forks source link

Dask-xgboost example for dask-examples #232

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

It would be nice to see an example using the Dask/XGBoost handoff for parallel training and predicting. This is a common question and so would likely have high value.

It would also be useful for this to be smoothly runnable on dask-examples. Presumably we'll have to use a few processes within a LocalCluster and be careful not to blow out RAM on the small containers (XGBoost can be a bit greedy).

mrocklin commented 6 years ago

It looks like there is an example in the documentation here: http://dask-ml.readthedocs.io/en/latest/examples/xgboost.html

It's nice in many respects (real data, easily interpretable problem, ...)

However a couple things are concerning about it:

  1. Hard to scale down for users to try things out easily
  2. The ROC curve at the end is not very exciting. I wonder if there is better pre-processing that could be done if we choose to continue with this dataset

Alternatively there might be some artificial dataset that we can create more easily instead.

stsievert commented 6 years ago

It looks like there is an example in the documentation here: http://dask-ml.readthedocs.io/en/latest/examples/xgboost.html

I certainly think this is a good example to keep, and maybe implement a new example in dask-examples. This is good for a static example – it shows an interesting problem that's harder to scale.

I think if we implement a new example for dask-examples, we should use a synthetic dataset. For me the biggest annoyance is the time it takes to process the dataset (at least a minute, often two minutes).

stsievert commented 6 years ago

I've opened a PR at https://github.com/dask/dask-examples/pull/14 that mirrors dask-ml documentation example, but is quicker to run because it uses synthetic data.

stsievert commented 6 years ago

This is closed by https://github.com/dask/dask-examples/pull/14, correct?

yash-dewasthale commented 2 years ago

Hello everyone I'm yash, I have experience in machine learning and web D. and I'm new to open source, I have never contributed before this, will anyone give me advice how to start my first contribution.