christopherjenness / DBCV

Python implementation of Density-Based Clustering Validation
MIT License
154 stars 41 forks source link

Parallelization for speed improvement #6

Closed alexeyegorov closed 6 years ago

alexeyegorov commented 6 years ago

I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.

christopherjenness commented 6 years ago

Yes please.

On Wed, Jun 13, 2018, 12:21 PM Alexey notifications@github.com wrote:

I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AKfcvhtgA3KSYz8TIZqTYEsrbAWv-OPnks5t8TwZgaJpZM4UmhVY .

christopherjenness commented 6 years ago

@alexeyegorov have you looked at this at all? If not, I'll make a branch this weekend.

JoaoSimoes94 commented 6 years ago

no, ... go ahead

2018-08-07 12:20 GMT+01:00 Christopher Jenness notifications@github.com:

@alexeyegorov https://github.com/alexeyegorov have you looked at this at all? If not, I'll make a branch this weekend.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/6#issuecomment-411022787, or mute the thread https://github.com/notifications/unsubscribe-auth/AQbJeDjD4qvXUN-P2on5WUM2vwXJOBx7ks5uOXfigaJpZM4UmhVY .

alexeyegorov commented 6 years ago

@christopherjenness I did start it, but had some other issues and did not follow up on it. Feel free to work on it! 👍

christopherjenness commented 6 years ago

Profiling branch: https://github.com/christopherjenness/DBCV/tree/profiling

christopherjenness commented 6 years ago

Looking at where we're getting jammed up: Because of the way I looped through data, the current program makes 2*n^2 calls to the distance function. We could easily cut the runtime in half by only making n^2 calls. But do we even need to calculate all pairwise distances? If not, we could save a lot of time. If so, I bet something like https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html would really speed up the program...

christopherjenness commented 6 years ago

dec6be2c9802d305625c21f75086c2d0ccef8d6b anecdotally speeds it up dramatically, but I will profile it and make sure it's still calculating the metric as expected when I get a chance.

alexeyegorov commented 6 years ago

This sounds really good and your thoughts do make sense. Probably a kind of test of same result would be the way. Thanks that you are taking care of it. :)

Alexey Egorov Wissenschaftlicher Mitarbeiter

Technische Universität Dortmund Fakultät für Informatik, LS 8 Otto-Hahn-Straße 12 44227 Dortmund

alexey.egorov@tu-dortmund.de www-ai.cs.uni-dortmund.de

Am So., 12. Aug. 2018 um 15:40 Uhr schrieb Christopher Jenness < notifications@github.com>:

dec6be2 https://github.com/christopherjenness/DBCV/commit/dec6be2c9802d305625c21f75086c2d0ccef8d6b anecdotally speeds it up dramatically, but I will profile it and make sure it's still calculating the metric as expected when I get a chance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/6#issuecomment-412343613, or mute the thread https://github.com/notifications/unsubscribe-auth/AE3c-8bGj5RmRWxgdrnAMH4lW89V4V51ks5uQDAvgaJpZM4UmhVY .

christopherjenness commented 6 years ago

This update gave a 53x speed improvement on a sample of 300 points (692 seconds down to 13 seconds)

New profiling results: 91307a3d142349f2fa871f574dfe7b83485d64ca

christopherjenness commented 6 years ago

And, scores look good.