Open mrocklin opened 6 years ago
It looks like there is an example in the documentation here: http://dask-ml.readthedocs.io/en/latest/examples/xgboost.html
It's nice in many respects (real data, easily interpretable problem, ...)
However a couple things are concerning about it:
Alternatively there might be some artificial dataset that we can create more easily instead.
It looks like there is an example in the documentation here: http://dask-ml.readthedocs.io/en/latest/examples/xgboost.html
I certainly think this is a good example to keep, and maybe implement a new example in dask-examples. This is good for a static example – it shows an interesting problem that's harder to scale.
I think if we implement a new example for dask-examples, we should use a synthetic dataset. For me the biggest annoyance is the time it takes to process the dataset (at least a minute, often two minutes).
I've opened a PR at https://github.com/dask/dask-examples/pull/14 that mirrors dask-ml documentation example, but is quicker to run because it uses synthetic data.
This is closed by https://github.com/dask/dask-examples/pull/14, correct?
Hello everyone I'm yash, I have experience in machine learning and web D. and I'm new to open source, I have never contributed before this, will anyone give me advice how to start my first contribution.
It would be nice to see an example using the Dask/XGBoost handoff for parallel training and predicting. This is a common question and so would likely have high value.
It would also be useful for this to be smoothly runnable on dask-examples. Presumably we'll have to use a few processes within a LocalCluster and be careful not to blow out RAM on the small containers (XGBoost can be a bit greedy).