Optimising connected components clustering

Motivation The current implementation requires preparing a n x n similarity matrix which has a quadratic memory footprint. This is somewhat alleviated by using sparse matrices from scipy. However, The process for constructing the matrix might be enough for computing the connected components as well. Similar to how the CD-HIT clustering works.

Possible implementation CD-HIT style iterating through the sim_df directly.