Closed nandsra21 closed 10 months ago
Once we have these calculations, we can drop the "nucleotide diversity" calculations across the entire alignments. The within/between group information is more informative for this project than the overall diversity in a given alignment.
Ignore -1 cluster, take a distance matrix and a metadata file (tsv file with cluster annotation in it), and name of column for grouping (mds_label, nextstrain_clade, etc. could be used as the clade information)
We want to understand the genetic resolution of clusters found in our embeddings compared to expert- and model-defined cluster annotations (e.g., Nextstrain clade, Nextclade pango lineage, etc.). Specifically, we want to know the average genetic distance within and between clades and clusters for each pathogen, embedding method, and clade type.
We might want to display these distances in a single plot that shows the within/between relationship per pathogen or in separate plots per pathogen, depending on how complex the figure becomes. This could be a supplemental figure. Alternately, we could summarize the same results in a table with columns for average +/- std dev within and between distances and rows per pathogen and method cluster/clade definition.