more algorithms: Butina and/or DBSCAN

UnixJunkie commented 4 years ago

Dear Patrick,

Maybe some other algorithms are more adapted to molecular datasets.

Butina, Darko. "Unsupervised data base clustering based on daylight's fingerprint and Tanimoto similarity: A fast and automated way to cluster small and large data sets." Journal of Chemical Information and Computer Sciences 39.4 (1999): 747-750.
https://pubs.acs.org/doi/10.1021/ci9803381

I think the following algorithm is not so known in chemoinformatics, but interesting:

Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996.
https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf

Unfortunately, I don't have a google account, so I cannot comment in your blog (https://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html)

Regards, F.

PatWalters commented 4 years ago

Thanks for the comments. I'm very aware of the Butina algorithm, I use it all the time. Unfortunately, that algorithm scales as n-squared with the number of molecules. As such, it's not great for large datasets. I haven't tried the other algorithm, but I'll check it out.

Pat

UnixJunkie commented 4 years ago

The authors say that DBSCAN has O(n * log(n)) complexity. So, it seems to not require the full (all to all) distance matrix. But, you will need a data structure to run region queries (like a mu-tree or a bisector tree, or a vantage-point tree).

UnixJunkie commented 4 years ago

Maybe DBSCAN to do the overall clustering, then Butina to enumerate representatives of each cluster of interest would be quite interesting.

PatWalters / faiss_kmeans

more algorithms: Butina and/or DBSCAN #2