Closed alexeyegorov closed 6 years ago
Yes please.
On Wed, Jun 13, 2018, 12:21 PM Alexey notifications@github.com wrote:
I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AKfcvhtgA3KSYz8TIZqTYEsrbAWv-OPnks5t8TwZgaJpZM4UmhVY .
@alexeyegorov have you looked at this at all? If not, I'll make a branch this weekend.
no, ... go ahead
2018-08-07 12:20 GMT+01:00 Christopher Jenness notifications@github.com:
@alexeyegorov https://github.com/alexeyegorov have you looked at this at all? If not, I'll make a branch this weekend.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/6#issuecomment-411022787, or mute the thread https://github.com/notifications/unsubscribe-auth/AQbJeDjD4qvXUN-P2on5WUM2vwXJOBx7ks5uOXfigaJpZM4UmhVY .
@christopherjenness I did start it, but had some other issues and did not follow up on it. Feel free to work on it! 👍
Profiling branch: https://github.com/christopherjenness/DBCV/tree/profiling
Looking at where we're getting jammed up: Because of the way I looped through data, the current program makes 2*n^2 calls to the distance function. We could easily cut the runtime in half by only making n^2 calls. But do we even need to calculate all pairwise distances? If not, we could save a lot of time. If so, I bet something like https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html would really speed up the program...
dec6be2c9802d305625c21f75086c2d0ccef8d6b anecdotally speeds it up dramatically, but I will profile it and make sure it's still calculating the metric as expected when I get a chance.
Alexey Egorov Wissenschaftlicher Mitarbeiter
Technische Universität Dortmund Fakultät für Informatik, LS 8 Otto-Hahn-Straße 12 44227 Dortmund
alexey.egorov@tu-dortmund.de www-ai.cs.uni-dortmund.de
Am So., 12. Aug. 2018 um 15:40 Uhr schrieb Christopher Jenness < notifications@github.com>:
dec6be2 https://github.com/christopherjenness/DBCV/commit/dec6be2c9802d305625c21f75086c2d0ccef8d6b anecdotally speeds it up dramatically, but I will profile it and make sure it's still calculating the metric as expected when I get a chance.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/christopherjenness/DBCV/issues/6#issuecomment-412343613, or mute the thread https://github.com/notifications/unsubscribe-auth/AE3c-8bGj5RmRWxgdrnAMH4lW89V4V51ks5uQDAvgaJpZM4UmhVY .
This update gave a 53x speed improvement on a sample of 300 points (692 seconds down to 13 seconds)
New profiling results: 91307a3d142349f2fa871f574dfe7b83485d64ca
And, scores look good.
I just wanted to try to calculate DBCV for my HDBSCAN result (312 points) and this takes me now forever. As I look into the code, it seems that it may be rather simple to parallelize e.g. the computation of mutual reachability graph as it takes so far the most time... I might fork and make a pull-request then.