blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
5 stars 1 forks source link

Assess cluster accuracy compared to Pango lineages for embeddings within a specific SC2 clade #110

Closed huddlej closed 1 month ago

huddlej commented 4 months ago

Select a Nextstrain clade from the early SC2 dataset like Delta (21J) that has poor resolution of Pango lineages in all embeddings, run all four methods on just the Delta/21J samples, and calculate distances of clusters from that clade to the Pango lineages for the same samples. This answers the question of whether we improve the cluster resolution by focusing on specific clades or time periods. Add this result to supplement (add also an embedding/tree figure for late SC2 colored by Pango lineage).

I would add this analysis as a sub-analysis of the early SC2 workflow with new rules for the following steps:

  1. Filter metadata to only samples from Nextclade clade 21J
  2. Filter the aligned FASTA file to only the samples from 21J
  3. Create a distance matrix for 21J alignment
  4. Create a "clean" alignment for PCA from the 21J alignment
  5. Run all four embedding methods on the corresponding alignment and/or distance matrix
  6. Find clusters in each embedding with the optimal cluster parameters
  7. Calculate cluster accuracy per method compared to collapsed Pango lineages from the 21J metadata file
  8. Report this cluster accuracy in the results for early SC2