This issue serves as a document for the tables and figures we want to include the paper as both main and supplemental items.
Figures
Main
Representative embeddings from simulated populations by pathogen (rows) and method (columns) with genomes colored by generation using embeddings with optimal parameter values (figures/simulated-populations-representative-embeddings.png built from notebooks/2023-03-24-plot-best-simulation-embeddings.ipynb).
Seasonal influenza HA embeddings with 2016-2018 data colored by clade (figures/flu-2016-2018-ha-embeddings-by-clade.png)
Seasonal influenza HA (2016-2018) Euclidean embedding distance per method by genetic distance (figures/flu-2016-2018-ha-euclidean-distance-by-genetic-distance.png)
Seasonal influenza HA (2016-2018) embeddings colored by embedding cluster (figures/flu-2016-2018-ha-embeddings-by-cluster.png) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment.
Seasonal influenza HA (2018-2020) embeddings colored by embedding cluster (figures/flu-2018-2020-ha-embeddings-by-cluster.png) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment.
Seasonal influenza HA and NA embeddings (2016-2018) colored by TreeKnit MCCs except for the root MCC (figures/flu-2016-2018-ha-na-embeddings-by-mcc.png) and annotated by normalized VI between TreeKnit and embedding clusters for HA-only and HA/NA to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (TreeKnit Maximally Compatible Clades).
SARS-CoV-2 embeddings (2020-2021) colored by Nextstrain clade (figures/sarscov2-embeddings-by-Nextstrain_clade-clade.png)
SARS-CoV-2 (2020-2021) Euclidean embedding distance per method by genetic distance (figures/sarscov2-euclidean-distance-by-genetic-distance.png)
SARS-CoV-2 (2020-2021) embeddings colored by embedding cluster (figures/sarscov2-embeddings-by-cluster-vs-Nextstrain_clade.png) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment (Nextstrain clade).
SARS-CoV-2 (2022-2023) embeddings colored by embedding cluster (figures/sarscov2-test-embeddings-by-cluster-vs-Nextstrain_clade.png) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (Nextstrain clade).
Within and between group genetic distances by pathogen dataset and group type.
Supplemental
MAEs from grid search of simulated influenza-like populations by method (figures/simulated-influenza-like-with-no-reassortment-scores-by-parameters.png)
MAEs from grid search of simulated coronavirus-like populations by method (figures/simulated-coronavirus-like-with-moderate-recombination-rate-scores-by-parameters.png)
Representative MDS embeddings with all components from simulated populations by pathogen (rows) with genomes colored by generation using embeddings with optimal parameter values (simulated-populations-representative-mds-embeddings.png).
Late H3N2 HA embeddings (2018-2020) colored by Nextstrain clade (flu-2018-2020-ha-embeddings-by-clade.png)
Late H3N2 HA MDS embeddings (2018-2020) with all components colored by Nextstrain clade (flu-2018-2020-mds-by-clade.png)
Paired seasonal influenza HA-only and HA/NA embeddings (2016-2018) colored by TreeKnit MCCs except for the root MCC (figures/flu-2016-2018-ha-only-vs-ha-na-embeddings-by-mcc.png) and annotated by normalized VI between TreeKnit and embedding clusters for HA-only and HA/NA to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (TreeKnit Maximally Compatible Clades).
SARS-CoV-2 embeddings (2020-2021) colored by Nextclade pango lineage (figures/sarscov2-embeddings-by-Nextclade_pango-clade.png)
SARS-CoV-2 (2020-2021) embeddings colored by embedding cluster (figures/sarscov2-embeddings-by-cluster-vs-Nextclade_pango.png) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment (collapse Nextclade pango lineage).
SARS-CoV-2 (2022-2023) embeddings colored by embedding cluster (figures/sarscov2-test-embeddings-by-cluster-vs-Nextclade_pango.png) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (collapsed Nextclade pango lineage).
Tables
Main
Optimal embedding parameters by pathogen and method based on simulated populations with median MAE (simulations/summary_scores_by_virus_reassortment_rate_and_method.csv)
Cluster accuracy by pathogen, method, and expert group type (e.g., Nextstrain clade, Nextclade pango)
Supplemental
Concatenated full table of grid search results for both population types (simulations/influenza-like/no-reassortment/gridsearch.csv and simulations/coronavirus-like/moderate-recombination-rate/gridsearch.csv)
Cluster-specific mutations per pathogen dataset and embedding method
This issue serves as a document for the tables and figures we want to include the paper as both main and supplemental items.
Figures
Main
figures/simulated-populations-representative-embeddings.png
built fromnotebooks/2023-03-24-plot-best-simulation-embeddings.ipynb
).figures/flu-2016-2018-ha-embeddings-by-clade.png
)figures/flu-2016-2018-ha-euclidean-distance-by-genetic-distance.png
)figures/flu-2016-2018-ha-embeddings-by-cluster.png
) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment.figures/flu-2018-2020-ha-embeddings-by-cluster.png
) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment.figures/flu-2016-2018-ha-na-embeddings-by-mcc.png
) and annotated by normalized VI between TreeKnit and embedding clusters for HA-only and HA/NA to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (TreeKnit Maximally Compatible Clades).figures/sarscov2-embeddings-by-Nextstrain_clade-clade.png
)figures/sarscov2-euclidean-distance-by-genetic-distance.png
)figures/sarscov2-embeddings-by-cluster-vs-Nextstrain_clade.png
) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment (Nextstrain clade).figures/sarscov2-test-embeddings-by-cluster-vs-Nextstrain_clade.png
) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (Nextstrain clade).Supplemental
figures/simulated-influenza-like-with-no-reassortment-scores-by-parameters.png
)figures/simulated-coronavirus-like-with-moderate-recombination-rate-scores-by-parameters.png
)simulated-populations-representative-mds-embeddings.png
).flu-2018-2020-ha-embeddings-by-clade.png
)flu-2018-2020-mds-by-clade.png
)figures/flu-2016-2018-ha-only-vs-ha-na-embeddings-by-mcc.png
) and annotated by normalized VI between TreeKnit and embedding clusters for HA-only and HA/NA to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (TreeKnit Maximally Compatible Clades).figures/sarscov2-embeddings-by-Nextclade_pango-clade.png
)figures/sarscov2-embeddings-by-cluster-vs-Nextclade_pango.png
) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment (collapse Nextclade pango lineage).figures/sarscov2-test-embeddings-by-cluster-vs-Nextclade_pango.png
) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (collapsed Nextclade pango lineage).Tables
Main
simulations/summary_scores_by_virus_reassortment_rate_and_method.csv
)Supplemental
simulations/influenza-like/no-reassortment/gridsearch.csv
andsimulations/coronavirus-like/moderate-recombination-rate/gridsearch.csv
)