blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
3 stars 1 forks source link

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #99

Open huddlej opened 1 month ago

huddlej commented 1 month ago

To test this idea, we can add rules to the early flu workflow to create clusters directly from the genetic distances and include these clusters into the grid search across different distance thresholds that we currently use for each embedding method. We can include the genetic distance clusters as their own "method" such that the "full_HDBSCAN_metadata.csv" output contains rows for this alternate approach with VI values per distance threshold. Then we can compare the VI values directly across all methods.

One cool feature of the way we've implemented pathogen-cluster is that it already accepts a distance matrix through the --embedding input argument, so users can skip the embeddings to get clusters from HDBSCAN directly from distances.

I originally suspected that HDBSCAN clusters from genetic distances would be most similar to clusters from MDS embeddings, since MDS maintains a nearly one-to-one mapping between genetic and Euclidean distances. However, an initial test of this idea rejected this hypothesis, producing far more clusters with genetic distances than with MDS with 3 components. Maybe the more appropriate comparison would be between MDS with ~N components.

Even if HDBSCAN clusters from genetic distances are closer to known clades, we will want to continue using the embedding-based clusters for most of the paper, since the main purpose of these clusters is to augment and evaluate the visual interpretability of the embeddings.