IBM / Hestia-OOD

Independent evaluation set construction for trustworthy ML models in biochemistry
https://ibm.github.io/Hestia-OOD/
MIT License
7 stars 1 forks source link

Optimising connected components clustering #50

Open RaulFD-creator opened 1 month ago

RaulFD-creator commented 1 month ago

Motivation The current implementation requires preparing a n x n similarity matrix which has a quadratic memory footprint. This is somewhat alleviated by using sparse matrices from scipy. However, The process for constructing the matrix might be enough for computing the connected components as well. Similar to how the CD-HIT clustering works.

Possible implementation CD-HIT style iterating through the sim_df directly.