dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
885 stars 255 forks source link

(H)DBSCAN #158

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

Some interest in a parallel / distributed version of this

Haven't looked to see if it's parallelizable.

lmcinnes commented 6 years ago

I believe an approximate version of this is parallelizable. Some of the relevant techniques are described in this paper: http://www.sysml.cc/doc/105.pdf

This is certainly of interest to me -- I had been planning something similar before the paper came out, and had already coded up a python version of NNDescent. That version is not parallel, but making it so would be interesting to try. One catch that comes to mind is that that implementation is currently using Numba instead of Cython for performance. Did the maintainers here have any strong feelings on Numba dependencies?

TomAugspurger commented 6 years ago

Great!

I’m happy to accept numba as a dependency (I have a branch moving our Cython k-means to numba. Using just numba would simplify our build and release process. )

lmcinnes commented 6 years ago

I'll see what I can put together then. I can't make any promises on getting it done soon, but I will definitely put it in the queue of things to work on.

TomAugspurger commented 6 years ago

Skimming the paper at http://www.sysml.cc/doc/105.pdf

https://github.com/lmcinnes/pynndescent would be great to make dask-friendly and include in dask-ml. If possible I'd like to re-use it for a top-level KNeighborsClassfier/Regressor API, and inside SpectralClustering (as an alternative to K-Means).

mrocklin commented 5 years ago

I chatted with @lmcinnes about this. It seems like this is slightly blocked on improved slicing of dask arrays with out-of-order indices. For example the following was inefficient

In [1]: import dask.array as da

In [2]: x = da.random.random((1000000, 100), chunks=(10000, 100))

In [3]: import numpy as np

In [4]: index = np.arange(len(x))

In [5]: np.random.shuffle(index)

In [6]: %time y = x[index]
/home/mrocklin/Software/anaconda/bin/ipython:1: PerformanceWarning: Slicing with an out-of-order index is generating 9900 times more chunks
  #!/home/mrocklin/Software/anaconda/bin/python
CPU times: user 4.48 s, sys: 188 ms, total: 4.67 s
Wall time: 4.68 s

In [7]: y.numblocks
Out[7]: (989999, 1)
jakirkham commented 5 years ago

So would solving issue ( https://github.com/dask/dask/issues/3409 ) address this?

mrocklin commented 5 years ago

Possibly. It might also be that @lmcinnes would want to do more targetted indexing than an entirely random shuffle.

On Sun, Jul 15, 2018 at 1:18 PM, jakirkham notifications@github.com wrote:

So would solving issue ( dask/dask#3409 https://github.com/dask/dask/issues/3409 ) address this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/158#issuecomment-405108411, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNY72sUiQpbPD5plw_7y33lEn5eCks5uG4dfgaJpZM4TOvYA .

mrocklin commented 5 years ago

Although it may also be that we can improve things today if we index directly many times, but make sure that the index is sorted.

In [1]: import dask.array as da

In [2]: x = da.random.random((1000000, 100), chunks=(10000, 100))

In [3]: import numpy as np

In [4]: index = np.random.randint(0, len(x), size=10000)

In [5]: %time x[index].numblocks
/home/mrocklin/Software/anaconda/bin/ipython:1: PerformanceWarning: Slicing with an out-of-order index is generating 100 times more chunks
  #!/home/mrocklin/Software/anaconda/bin/python
CPU times: user 73.6 ms, sys: 3.8 ms, total: 77.4 ms
Wall time: 76.8 ms
Out[5]: (9907, 1)

In [6]: index.sort()

In [7]: %time x[index].numblocks
CPU times: user 4.74 ms, sys: 47 µs, total: 4.78 ms
Wall time: 4.15 ms
Out[7]: (100, 1)
mrocklin commented 5 years ago

I've laid out a plan in https://github.com/dask/dask/issues/3409#issuecomment-405254656 to improve things a bit. That should be accessible for others to contribute if they're interested.

TomAugspurger commented 5 years ago

We've started to make progress on this on the Dask side: (https://github.com/dask/dask/pull/3808 for slicing performance with as shuffled array).

@lmcinnes do you have a rough sketch of the next steps for a parallel / distributed pynndescent? On a brief examination, the random projections would be the first thing? I'm interested in working on this a bit.