blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Outline tables and figures #60

Closed huddlej closed 8 months ago

huddlej commented 1 year ago

This issue serves as a document for the tables and figures we want to include the paper as both main and supplemental items.

Figures

Main

  1. Representative embeddings from simulated populations by pathogen (rows) and method (columns) with genomes colored by generation using embeddings with optimal parameter values (figures/simulated-populations-representative-embeddings.png built from notebooks/2023-03-24-plot-best-simulation-embeddings.ipynb).
  2. Seasonal influenza HA embeddings with 2016-2018 data colored by clade (figures/flu-2016-2018-ha-embeddings-by-clade.png)
  3. Seasonal influenza HA (2016-2018) Euclidean embedding distance per method by genetic distance (figures/flu-2016-2018-ha-euclidean-distance-by-genetic-distance.png)
  4. Seasonal influenza HA (2016-2018) embeddings colored by embedding cluster (figures/flu-2016-2018-ha-embeddings-by-cluster.png) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment.
  5. Seasonal influenza HA (2018-2020) embeddings colored by embedding cluster (figures/flu-2018-2020-ha-embeddings-by-cluster.png) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment.
  6. Seasonal influenza HA and NA embeddings (2016-2018) colored by TreeKnit MCCs except for the root MCC (figures/flu-2016-2018-ha-na-embeddings-by-mcc.png) and annotated by normalized VI between TreeKnit and embedding clusters for HA-only and HA/NA to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (TreeKnit Maximally Compatible Clades).
  7. SARS-CoV-2 embeddings (2020-2021) colored by Nextstrain clade (figures/sarscov2-embeddings-by-Nextstrain_clade-clade.png)
  8. SARS-CoV-2 (2020-2021) Euclidean embedding distance per method by genetic distance (figures/sarscov2-euclidean-distance-by-genetic-distance.png)
  9. SARS-CoV-2 (2020-2021) embeddings colored by embedding cluster (figures/sarscov2-embeddings-by-cluster-vs-Nextstrain_clade.png) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment (Nextstrain clade).
  10. SARS-CoV-2 (2022-2023) embeddings colored by embedding cluster (figures/sarscov2-test-embeddings-by-cluster-vs-Nextstrain_clade.png) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (Nextstrain clade).
  11. Within and between group genetic distances by pathogen dataset and group type.

Supplemental

  1. MAEs from grid search of simulated influenza-like populations by method (figures/simulated-influenza-like-with-no-reassortment-scores-by-parameters.png)
  2. MAEs from grid search of simulated coronavirus-like populations by method (figures/simulated-coronavirus-like-with-moderate-recombination-rate-scores-by-parameters.png)
  3. Representative MDS embeddings with all components from simulated populations by pathogen (rows) with genomes colored by generation using embeddings with optimal parameter values (simulated-populations-representative-mds-embeddings.png).
  4. Late H3N2 HA embeddings (2018-2020) colored by Nextstrain clade (flu-2018-2020-ha-embeddings-by-clade.png)
  5. Late H3N2 HA MDS embeddings (2018-2020) with all components colored by Nextstrain clade (flu-2018-2020-mds-by-clade.png)
  6. Paired seasonal influenza HA-only and HA/NA embeddings (2016-2018) colored by TreeKnit MCCs except for the root MCC (figures/flu-2016-2018-ha-only-vs-ha-na-embeddings-by-mcc.png) and annotated by normalized VI between TreeKnit and embedding clusters for HA-only and HA/NA to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (TreeKnit Maximally Compatible Clades).
  7. SARS-CoV-2 embeddings (2020-2021) colored by Nextclade pango lineage (figures/sarscov2-embeddings-by-Nextclade_pango-clade.png)
  8. SARS-CoV-2 (2020-2021) embeddings colored by embedding cluster (figures/sarscov2-embeddings-by-cluster-vs-Nextclade_pango.png) and annotated by normalized VI to indicate accuracy of clusters for training data compared to expert clade assignment (collapse Nextclade pango lineage).
  9. SARS-CoV-2 (2022-2023) embeddings colored by embedding cluster (figures/sarscov2-test-embeddings-by-cluster-vs-Nextclade_pango.png) and annotated by normalized VI to indicate accuracy of clusters for out-of-sample data compared to expert clade assignment (collapsed Nextclade pango lineage).

Tables

Main

  1. Optimal embedding parameters by pathogen and method based on simulated populations with median MAE (simulations/summary_scores_by_virus_reassortment_rate_and_method.csv)
  2. Cluster accuracy by pathogen, method, and expert group type (e.g., Nextstrain clade, Nextclade pango)

Supplemental

  1. Concatenated full table of grid search results for both population types (simulations/influenza-like/no-reassortment/gridsearch.csv and simulations/coronavirus-like/moderate-recombination-rate/gridsearch.csv)
  2. Cluster-specific mutations per pathogen dataset and embedding method