Motivation
The current implementation requires preparing a n x n similarity matrix which has a quadratic memory footprint. This is somewhat alleviated by using sparse matrices from scipy. However, The process for constructing the matrix might be enough for computing the connected components as well. Similar to how the CD-HIT clustering works.
Possible implementationCD-HIT style iterating through the sim_df directly.
Motivation The current implementation requires preparing a
n x n
similarity matrix which has a quadratic memory footprint. This is somewhat alleviated by using sparse matrices fromscipy
. However, The process for constructing the matrix might be enough for computing the connected components as well. Similar to how theCD-HIT
clustering works.Possible implementation
CD-HIT
style iterating through thesim_df
directly.