blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Use new pathogen-embed interface to embed alignments and tune HDBSCAN hyperparameters #56

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

Uses the new commands in the pathogen-embed module (pathogen-distance, pathogen-embed, and pathogen-cluster) to embed alignments separately from clustering, identify optimal HDBSCAN distance thresholds per method and clade definition for H3N2 and SARS-CoV-2 training data, and apply these optimal values to H3N2 and SARS-CoV-2 test data.

This PR introduces new early/late SARS-CoV-2 data for the training/test split, respectively, and identifies the optimal cluster thresholds for both Nextstrain clades and collapsed Nextclade pango lineages. These two types of clade definition reflect different operational needs for "clades" and allow us to test the genetic resolution of clusters produced by different embeddings after we've already optimizing method parameters to match Euclidean/genetic distance.