Adds rules to all natural flu and SARS-CoV-2 workflows to apply HDBSCAN clustering to the genetic distance matrix that we use to produce the embeddings. We name this clustering "method" as "genetic" and include it in the grid search to find the optimal distance threshold per method for early H3N2 HA data. This PR updates tables, figures, and manuscript text to reflect the inclusion of these genetic distance clusters as a point of comparison to embedding clusters.
Development checklist
[x] Add workflow logic to find optimal cluster threshold for clusters based on genetic distances
[x] Rerun workflow to produce cluster accuracies for early flu and SC2 datasets
[x] Update accuracy by threshold figure to not refer to "Euclidean" distance threshold
[x] Rerun all workflow analyses related to clusters to get tables with genetic cluster mutations and monophyly and figures for within-between cluster distances, etc.
[x] Expand accuracy table (Supplementary Table S1) to include VI values for late flu, HA/NA, and late SC2 data by adding columns for late datasets to existing table.
[x] Add number of cluster labels per dataset as a column, too.
[x] Update manuscript text to reflect clustering by genetic distances in methods and results
Description
Adds rules to all natural flu and SARS-CoV-2 workflows to apply HDBSCAN clustering to the genetic distance matrix that we use to produce the embeddings. We name this clustering "method" as "genetic" and include it in the grid search to find the optimal distance threshold per method for early H3N2 HA data. This PR updates tables, figures, and manuscript text to reflect the inclusion of these genetic distance clusters as a point of comparison to embedding clusters.
Development checklist
Related issues
Depends on https://github.com/blab/pathogen-embed/pull/33 Closes #99