Avoid compute in datasets

dask / dask-ml

Scalable Machine Learning with Dask

http://ml.dask.org

BSD 3-Clause "New" or "Revised" License

893 stars 255 forks source link

Avoid compute in datasets #265

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

https://github.com/dask/dask-ml/blob/d5801584d092d8f13f1b38aaf4da5dc3caa6a213/dask_ml/datasets.py#L332 isn't great, especially in settings like Hyperband #221, that are using the distributed scheduler.

We could probably replace

    rng = dask_ml.utils.check_random_state(random_state)

with

    rng = sklearn.utils.check_random_state(random_state)

and draw

informative_idx
random data to seed the dask.array.RandomState that is eventually used to generate the large random data.

dma092 commented 5 years ago

Is this still open? I used dask almost a year ago and I would like to contribute.

TomAugspurger commented 5 years ago

Yes, I think so. There may be a draw_seed in utils may help.

On May 25, 2019, at 03:50, dma092 notifications@github.com wrote:

Is this still open?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

dma092 commented 5 years ago

Can I work on it?

TomAugspurger commented 5 years ago

That would be a great. The docs have contributing guidelines.

On May 26, 2019, at 14:26, dma092 notifications@github.com wrote:

Can I work on it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

stsievert commented 5 years ago

The docs have contributing guidelines.

https://ml.dask.org/contributing.html

dma092 commented 4 years ago

It seems to me that all the tests in test_datsets passes after just commenting informative_idx, beta = dask.compute(informative_idx, beta) . What do you think?

TomAugspurger commented 4 years ago

If all the tests pass then that should be fine.

dma092 commented 4 years ago

If all the tests pass then that should be fine.

Are you talking about only the tests in test_datasets.py?

TomAugspurger commented 4 years ago

I meant the entire test suite, since other tests use it.