We want to report how well each method scales with the input data size, for example, how well t-SNE scales to 10,000s or millions of SARS-CoV-2 genomes.
Computational complexity
Regarding our understanding of how these methods scale from their algorithmic design we know that:
t-SNE computational complexity is O(n^2) (see: van der Maaten and Hinton 2008), but tree-based heuristics (Yang et al. 2013 and van der Maaten 2014) reduce that complexity to O(n log n). These algorithms are implemented in openTSNE and FIt-SNE described in Linderman et al. 2019 Nature Methods. The scikit-learn Barnes-Hut implementation should be O(n log n) complexity. The openTSNE module also supports mapping new data to an existing reference embedding which could improve the scalability of this method (cite: https://github.com/pavlin-policar/openTSNE#citation).
MDS has O(n^3) complexity (Yang et al. 2006) but similar heuristics to those described above for t-SNE can reduce complexity to O(n log n).
PCA also requires eigendecomposition of a N x P matrix, so its complexity depends on those dimensions, O(n^3) or O(p^3) (Jolliffe and Cadima 2016). We should note from that same reference the known "brittleness" of PCA to outliers like gaps in our alignment inputs and the alternative of "Robust PCA".
Regardless of the scaling of the different embedding methods, the pairwise genetic distance calculations will always be O(n^2) which imposes an upper-bound on t-SNE and UMAP.
We can update the discussion to reflect these known complexities of the different algorithms.
Demonstration with real data
Ideally, we want to measure the run time and size of data files.
The full H3N2 HA dataset from NCBI only has N=8934 HA sequences after deduplication which is not enough to test the scales of 10000s. We could also try using GISAID data to scale up to more sequences, but I'd prefer to stick with completely open data for this project.
Another option would be to download Nextstrain's 100,000-samples dataset for SARS-CoV-2 data from GenBank and downsample these data to 5000, 10000, and 20000 samples. These data are fully open like the rest of the data in this project and have enough samples to be useful for this kind of test. The simplest approach I could imagine to implement this analysis would be to create a parallel Snakemake workflow directory like sars-cov-2-nextstrain-scaling-by-sample with a workflow that does the following:
Download the 100k metadata and sequences linked above
For each sample size N in the list of [2500, 5000, 10000, 20000]
Randomly sample N records from the metadata and sequences
Align sequences to the SARS-CoV-2 reference with Nextclade
Run pathogen-distance on alignment
Run pathogen-embed for all four methods
Run pathogen-cluster for all four method embeddings
For the align, distance, embed, and cluster rules in this workflow, define a benchmarks directive with wildcards based on the number of samples. The penultimate rule should depend on the pathogen-cluster output and parse the benchmarks files to get runtime and memory use per step and number of samples. The final rule should plot the runtime and memory use in separate adjacent panels of the same figure with N samples on the x-axis, the runtime/memory use on the y-axis, and the step of the workflow shown by color.
References to add for complexity description
@inproceedings{Yang2006,
title={A fast approximation to multidimensional scaling},
author={Yang, Tynia and Liu, Jinze and McMillan, Leonard and Wang, Wei},
booktitle={IEEE workshop on computation intensive methods for computer vision},
year={2006}
}
@InProceedings{Yang2013,
title = {Scalable Optimization of Neighbor Embedding for Visualization},
author = {Yang, Zhirong and Peltonen, Jaakko and Kaski, Samuel},
booktitle = {Proceedings of the 30th International Conference on Machine Learning},
pages = {127--135},
year = {2013},
editor = {Dasgupta, Sanjoy and McAllester, David},
volume = {28},
number = {2},
series = {Proceedings of Machine Learning Research},
address = {Atlanta, Georgia, USA},
month = {17--19 Jun},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v28/yang13b.pdf},
url = {https://proceedings.mlr.press/v28/yang13b.html}
}
@article{vandermaaten2014,
author = {Laurens van der Maaten},
title = {Accelerating t-SNE using Tree-Based Algorithms},
journal = {Journal of Machine Learning Research},
year = {2014},
volume = {15},
number = {93},
pages = {3221--3245},
url = {http://jmlr.org/papers/v15/vandermaaten14a.html}
}
@article {Policar2019,
author = {Poli{\v c}ar, Pavlin G. and Stra{\v z}ar, Martin and Zupan, Bla{\v z}},
title = {openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding},
year = {2019},
doi = {10.1101/731877},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2019/08/13/731877},
eprint = {https://www.biorxiv.org/content/early/2019/08/13/731877.full.pdf},
journal = {bioRxiv}
}
@Article{Policar2023,
author={Poli{\v{c}}ar, Pavlin G.
and Stra{\v{z}}ar, Martin
and Zupan, Bla{\v{z}}},
title={Embedding to reference t-SNE space addresses batch effects in single-cell classification},
journal={Machine Learning},
year={2023},
month={Feb},
day={01},
volume={112},
number={2},
pages={721-740},
issn={1573-0565},
doi={10.1007/s10994-021-06043-1},
url={https://doi.org/10.1007/s10994-021-06043-1}
}
Description
We want to report how well each method scales with the input data size, for example, how well t-SNE scales to 10,000s or millions of SARS-CoV-2 genomes.
Computational complexity
Regarding our understanding of how these methods scale from their algorithmic design we know that:
Regardless of the scaling of the different embedding methods, the pairwise genetic distance calculations will always be O(n^2) which imposes an upper-bound on t-SNE and UMAP.
We can update the discussion to reflect these known complexities of the different algorithms.
Demonstration with real data
Ideally, we want to measure the run time and size of data files.
The full H3N2 HA dataset from NCBI only has N=8934 HA sequences after deduplication which is not enough to test the scales of 10000s. We could also try using GISAID data to scale up to more sequences, but I'd prefer to stick with completely open data for this project.
Another option would be to download Nextstrain's 100,000-samples dataset for SARS-CoV-2 data from GenBank and downsample these data to 5000, 10000, and 20000 samples. These data are fully open like the rest of the data in this project and have enough samples to be useful for this kind of test. The simplest approach I could imagine to implement this analysis would be to create a parallel Snakemake workflow directory like
sars-cov-2-nextstrain-scaling-by-sample
with a workflow that does the following:For the align, distance, embed, and cluster rules in this workflow, define a
benchmarks
directive with wildcards based on the number of samples. The penultimate rule should depend on the pathogen-cluster output and parse the benchmarks files to get runtime and memory use per step and number of samples. The final rule should plot the runtime and memory use in separate adjacent panels of the same figure with N samples on the x-axis, the runtime/memory use on the y-axis, and the step of the workflow shown by color.References to add for complexity description