Below is a proposed outline for the paper that we can edit collaboratively here and use as a reference while we write the text.
[ ] Introduction
[ ] Phylogenetics are a key component of genomic epidemiology
[ ] Trees are not necessary or appropriate for all analyses
[ ] Pairwise distances from genomes plus epi information identifies clusters
[ ] Genome alignments can reveal QC issues and novel mutations
[ ] Phylogenetic placement algorithms and alignments can place new genomes on existing trees without full inference
[ ] Reassortment and recombination violate phylogenetic assumptions and require alternate methods
[ ] Most phylogenetic methods ignore insertions and deletions (indels)
[ ] Apply model-free dimensionality reduction methods to genome sequences and indel-aware distances to produce embeddings and understand how these embeddings capture known genetic relationships
[ ] Methods
[ ] Simulate H3N2-like and coronavirus-like populations to tune embedding parameters per pathogen
[ ] Build phylogenetic trees
[ ] Calculate pairwise distances including insertions and deletions
[ ] Build embeddings with optimal parameters
[ ] Calculate relationships between Euclidean and genetic distances
[ ] Assign clusters to embeddings with HDBSCAN
[ ] Calculate cluster accuracy relative to known genetic group labels with variation of information (VI)
[ ] Identify optimal cluster parameters by optimizing VI with training data from natural populations
Below is a proposed outline for the paper that we can edit collaboratively here and use as a reference while we write the text.