Outline paper - Githubissues

Below is a proposed outline for the paper that we can edit collaboratively here and use as a reference while we write the text.

[ ] Introduction
- [ ] Phylogenetics are a key component of genomic epidemiology
- [ ] Trees are not necessary or appropriate for all analyses
  - [ ] Pairwise distances from genomes plus epi information identifies clusters
  - [ ] Genome alignments can reveal QC issues and novel mutations
  - [ ] Phylogenetic placement algorithms and alignments can place new genomes on existing trees without full inference
  - [ ] Reassortment and recombination violate phylogenetic assumptions and require alternate methods
  - [ ] Most phylogenetic methods ignore insertions and deletions (indels)
- [ ] Apply model-free dimensionality reduction methods to genome sequences and indel-aware distances to produce embeddings and understand how these embeddings capture known genetic relationships
[ ] Methods
- [ ] Simulate H3N2-like and coronavirus-like populations to tune embedding parameters per pathogen
- [ ] Build phylogenetic trees
- [ ] Calculate pairwise distances including insertions and deletions
- [ ] Build embeddings with optimal parameters
- [ ] Calculate relationships between Euclidean and genetic distances
- [ ] Assign clusters to embeddings with HDBSCAN
- [ ] Calculate cluster accuracy relative to known genetic group labels with variation of information (VI)
- [ ] Identify optimal cluster parameters by optimizing VI with training data from natural populations
- [ ] Identify cluster-specific mutations
[ ] Results
- [ ] Simulations reveal optimal embedding parameters
  - [ ] Figure. Representative embeddings per simulated population type and embedding method
    - Influenza-like HA sequences
    - SARS-CoV-2-like sequences with moderate recombination
  - [ ] Table. Optimal embedding parameters per method?
- [ ] Embeddings of influenza HA sequences recapitulate known genetic relationships
  - [ ] Figure. Phylogenetic tree and embeddings for H3N2 HA colored by known clade.
  - [ ] Figure. Euclidean distance correlates with genetic distance in H3N2 HA embeddings
  - [ ] Figure. Clusters of embeddings for H3N2 HA recapitulate expert clade designations
- [ ] Embeddings of influenza HA and NA sequences recapitulate reassortment clusters detected by model-informed methods
  - [ ] Figure. Clusters of embeddings for H3N2 HA and NA identify known reassortment events more accurately than clusters based on HA embeddings alone
- [ ] Embeddings of SARS-CoV-2 sequences recapitulate known genetic relationships including recombination events
  - [ ] Figure. Phylogenetic tree and embeddings of SC2 colored by known (Nextstrain or Pango) clade
  - [ ] Figure. Euclidean distance correlations with genetic distance in SARS-CoV-2 embeddings
  - [ ] Figure. Clusters of embeddings for SARS-CoV-2 recapitulate Pango clade designations including recombination events
[ ] Discussion
- [ ] Summary of main conclusions
- [ ] Advantages of embedding methods
- [ ] Limitations of embedding methods
- [ ] Next steps

blab / cartography

Outline paper #29