Calculate within and between clade/cluster genetic distances per embedding and clade definition

nandsra21 commented 1 year ago

We want to understand the genetic resolution of clusters found in our embeddings compared to expert- and model-defined cluster annotations (e.g., Nextstrain clade, Nextclade pango lineage, etc.). Specifically, we want to know the average genetic distance within and between clades and clusters for each pathogen, embedding method, and clade type.

We might want to display these distances in a single plot that shows the within/between relationship per pathogen or in separate plots per pathogen, depending on how complex the figure becomes. This could be a supplemental figure. Alternately, we could summarize the same results in a table with columns for average +/- std dev within and between distances and rows per pathogen and method cluster/clade definition.

[x] Create a script to calculate the average and std dev genetic distance within and between groups given a distance matrix, a per-strain "group" definition file (e.g., a TSV with Nextstrain clade or a CSV with HDBSCAN cluster labels from a t-SNE embedding), and the column of the given file to use for groups
[x] Calculate within/between distance for early flu (Nextstrain clade and clusters per embedding method)
[x] Calculate within/between distance for late flu (Nextstrain clade and clusters per embedding method)
[x] Calculate within/between distance for HA/NA flu (TreeKnit MCCs and clusters per embedding method)
[x] Calculate within/between distance for early SARS-CoV-2 (Nextstrain clade, Nextclade pango lineage, and clusters per embedding method)
[x] Calculate within/between distance for late SARS-CoV-2 (Nextstrain clade, Nextclade pango lineage, and clusters per embedding method)
[x] Aggregate distances per pathogen into a table or figure
[x] Remove old "nucleotide diversity" calculations with pixy or dendropy and remove associated dependencies (pixy, dendropy, samtools, snp-sites, tabix)

huddlej commented 1 year ago

Once we have these calculations, we can drop the "nucleotide diversity" calculations across the entire alignments. The within/between group information is more informative for this project than the overall diversity in a given alignment.

nandsra21 commented 1 year ago

Ignore -1 cluster, take a distance matrix and a metadata file (tsv file with cluster annotation in it), and name of column for grouping (mds_label, nextstrain_clade, etc. could be used as the clade information)

blab / cartography

Calculate within and between clade/cluster genetic distances per embedding and clade definition #53