blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
3 stars 1 forks source link

Report how methods scale with data size #94

Closed huddlej closed 2 months ago

huddlej commented 2 months ago

Description

We want to report how well each method scales with the input data size, for example, how well t-SNE scales to 10,000s or millions of SARS-CoV-2 genomes.

Computational complexity

Regarding our understanding of how these methods scale from their algorithmic design we know that:

Regardless of the scaling of the different embedding methods, the pairwise genetic distance calculations will always be O(n^2) which imposes an upper-bound on t-SNE and UMAP.

We can update the discussion to reflect these known complexities of the different algorithms.

Demonstration with real data

Ideally, we want to measure the run time and size of data files.

The full H3N2 HA dataset from NCBI only has N=8934 HA sequences after deduplication which is not enough to test the scales of 10000s. We could also try using GISAID data to scale up to more sequences, but I'd prefer to stick with completely open data for this project.

Another option would be to download Nextstrain's 100,000-samples dataset for SARS-CoV-2 data from GenBank and downsample these data to 5000, 10000, and 20000 samples. These data are fully open like the rest of the data in this project and have enough samples to be useful for this kind of test. The simplest approach I could imagine to implement this analysis would be to create a parallel Snakemake workflow directory like sars-cov-2-nextstrain-scaling-by-sample with a workflow that does the following:

  1. Download the 100k metadata and sequences linked above
  2. For each sample size N in the list of [2500, 5000, 10000, 20000]
    1. Randomly sample N records from the metadata and sequences
    2. Align sequences to the SARS-CoV-2 reference with Nextclade
    3. Run pathogen-distance on alignment
    4. Run pathogen-embed for all four methods
    5. Run pathogen-cluster for all four method embeddings

For the align, distance, embed, and cluster rules in this workflow, define a benchmarks directive with wildcards based on the number of samples. The penultimate rule should depend on the pathogen-cluster output and parse the benchmarks files to get runtime and memory use per step and number of samples. The final rule should plot the runtime and memory use in separate adjacent panels of the same figure with N samples on the x-axis, the runtime/memory use on the y-axis, and the step of the workflow shown by color.

References to add for complexity description

@inproceedings{Yang2006,
  title={A fast approximation to multidimensional scaling},
  author={Yang, Tynia and Liu, Jinze and McMillan, Leonard and Wang, Wei},
  booktitle={IEEE workshop on computation intensive methods for computer vision},
  year={2006}
}

@InProceedings{Yang2013,
  title =    {Scalable Optimization of Neighbor Embedding for Visualization},
  author =   {Yang, Zhirong and Peltonen, Jaakko and Kaski, Samuel},
  booktitle =    {Proceedings of the 30th International Conference on Machine Learning},
  pages =    {127--135},
  year =     {2013},
  editor =   {Dasgupta, Sanjoy and McAllester, David},
  volume =   {28},
  number =       {2},
  series =   {Proceedings of Machine Learning Research},
  address =      {Atlanta, Georgia, USA},
  month =    {17--19 Jun},
  publisher =    {PMLR},
  pdf =      {http://proceedings.mlr.press/v28/yang13b.pdf},
  url =      {https://proceedings.mlr.press/v28/yang13b.html}
}

@article{vandermaaten2014,
  author  = {Laurens van der Maaten},
  title   = {Accelerating t-SNE using Tree-Based Algorithms},
  journal = {Journal of Machine Learning Research},
  year    = {2014},
  volume  = {15},
  number  = {93},
  pages   = {3221--3245},
  url     = {http://jmlr.org/papers/v15/vandermaaten14a.html}
}

@article {Policar2019,
    author = {Poli{\v c}ar, Pavlin G. and Stra{\v z}ar, Martin and Zupan, Bla{\v z}},
    title = {openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding},
    year = {2019},
    doi = {10.1101/731877},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2019/08/13/731877},
    eprint = {https://www.biorxiv.org/content/early/2019/08/13/731877.full.pdf},
    journal = {bioRxiv}
}

@Article{Policar2023,
author={Poli{\v{c}}ar, Pavlin G.
and Stra{\v{z}}ar, Martin
and Zupan, Bla{\v{z}}},
title={Embedding to reference t-SNE space addresses batch effects in single-cell classification},
journal={Machine Learning},
year={2023},
month={Feb},
day={01},
volume={112},
number={2},
pages={721-740},
issn={1573-0565},
doi={10.1007/s10994-021-06043-1},
url={https://doi.org/10.1007/s10994-021-06043-1}
}