blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Calculate within and between clade/cluster genetic distances per embedding and clade definition #53

Closed nandsra21 closed 10 months ago

nandsra21 commented 1 year ago

We want to understand the genetic resolution of clusters found in our embeddings compared to expert- and model-defined cluster annotations (e.g., Nextstrain clade, Nextclade pango lineage, etc.). Specifically, we want to know the average genetic distance within and between clades and clusters for each pathogen, embedding method, and clade type.

We might want to display these distances in a single plot that shows the within/between relationship per pathogen or in separate plots per pathogen, depending on how complex the figure becomes. This could be a supplemental figure. Alternately, we could summarize the same results in a table with columns for average +/- std dev within and between distances and rows per pathogen and method cluster/clade definition.

huddlej commented 1 year ago

Once we have these calculations, we can drop the "nucleotide diversity" calculations across the entire alignments. The within/between group information is more informative for this project than the overall diversity in a given alignment.

nandsra21 commented 1 year ago

Ignore -1 cluster, take a distance matrix and a metadata file (tsv file with cluster annotation in it), and name of column for grouping (mds_label, nextstrain_clade, etc. could be used as the clade information)