Is it possible to run DBSCAN in distributed memory?

arborx / ArborX

Performance-portable geometric search library

BSD 3-Clause "New" or "Revised" License

181 stars 38 forks source link

Is it possible to run DBSCAN in distributed memory? #1149

Closed lykos98 closed 3 weeks ago

lykos98 commented 1 month ago

Hi! I found this repository by reading the newly published paper on the advancements of the library. The paper mentions that you were able to run dbscan on 2*10^12 points, but in the documentation it is not clear how and if the actual implementation of dbscan works in distributed memory.

Can you provide an example of the algorithm applied also using mpi when the data is scattered between different processes? Thank you! F

aprokop commented 1 month ago

Hi @lykos98. In the paper, the application (HACC) was responsible for the distributed memory. The way it is organized is that it construct local domains with some halos that guarantees that the clusters will be fully contained in the local data. That is what ArborX' algorithms were run on. We do not have a distributed DBSCAN implementation at the moment.

lykos98 commented 3 weeks ago

Hi thank you for the quick answer! I understand implementing it in a true distributed fashion is quite difficult, nevertheless the performance reported in the paper is impressive. Congratulations!

aprokop commented 3 weeks ago

@lykos98 I'm not sure it's that difficult. In the back of my mind, I always thought that if we needed to implement it, we could combine our local algorithm with a distributed part from Patwary et al paper.

lykos98 commented 2 weeks ago

I'm sorry for not getting back to you sooner. I missed the notification. Thank you for the reference! I had already encountered that paper a while ago, I did not mean actually in the previous comment "difficult" in the algorithmic sense but in the implementation side which needs a non-negligible amount of effort to make it work. I'll keep an eye on this repository, thank you again!