Uses the new commands in the pathogen-embed module (pathogen-distance, pathogen-embed, and pathogen-cluster) to embed alignments separately from clustering, identify optimal HDBSCAN distance thresholds per method and clade definition for H3N2 and SARS-CoV-2 training data, and apply these optimal values to H3N2 and SARS-CoV-2 test data.
This PR introduces new early/late SARS-CoV-2 data for the training/test split, respectively, and identifies the optimal cluster thresholds for both Nextstrain clades and collapsed Nextclade pango lineages. These two types of clade definition reflect different operational needs for "clades" and allow us to test the genetic resolution of clusters produced by different embeddings after we've already optimizing method parameters to match Euclidean/genetic distance.
Uses the new commands in the pathogen-embed module (
pathogen-distance
,pathogen-embed
, andpathogen-cluster
) to embed alignments separately from clustering, identify optimal HDBSCAN distance thresholds per method and clade definition for H3N2 and SARS-CoV-2 training data, and apply these optimal values to H3N2 and SARS-CoV-2 test data.This PR introduces new early/late SARS-CoV-2 data for the training/test split, respectively, and identifies the optimal cluster thresholds for both Nextstrain clades and collapsed Nextclade pango lineages. These two types of clade definition reflect different operational needs for "clades" and allow us to test the genetic resolution of clusters produced by different embeddings after we've already optimizing method parameters to match Euclidean/genetic distance.