Open UnixJunkie opened 4 years ago
Thanks for the comments. I'm very aware of the Butina algorithm, I use it all the time. Unfortunately, that algorithm scales as n-squared with the number of molecules. As such, it's not great for large datasets. I haven't tried the other algorithm, but I'll check it out.
Pat
The authors say that DBSCAN has O(n * log(n)) complexity. So, it seems to not require the full (all to all) distance matrix. But, you will need a data structure to run region queries (like a mu-tree or a bisector tree, or a vantage-point tree).
Maybe DBSCAN to do the overall clustering, then Butina to enumerate representatives of each cluster of interest would be quite interesting.
Dear Patrick,
Maybe some other algorithms are more adapted to molecular datasets.
I think the following algorithm is not so known in chemoinformatics, but interesting:
Unfortunately, I don't have a google account, so I cannot comment in your blog (https://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html)
Regards, F.