dib-lab / 2020-paper-sourmash-gather

Here we describe an extension of MinHash that permits accurate compositional analysis of metagenomes with low memory and disk requirements.
https://dib-lab.github.io/2020-paper-sourmash-gather
Other
8 stars 1 forks source link

leftover text from discussion #9

Open ctb opened 3 years ago

ctb commented 3 years ago

leftover text discussion:

Other containment estimation methods such as CMash [@koslicki_improving_2019] and mash screen [@ondov_mash_2019], can also implement gather.

Running a search requires access to the original dataset (_mash screen_) for the query, or a Bloom Filter derived from the original dataset (_CMash_), and when the collection of reference sketches is updated the Bloom Filter from _CMash_ can be reused, but _mash screen_ needs access to the original dataset again. ---- (Maybe already covered above, or maybe should be moved to "gather can be applied to all the data"?) Since _Scaled MinHash_ sketches allow using the sketch directly for `gather`, which are a fraction of the original data in size and also allow enumerating all the elements, an operation not possible with Bloom Filters, they can be stored and reused for large collections of sequencing datasets, including public databases like the Sequence Read Archive [@leinonen_sequence_2011]. A service that calculate these _Scaled MinHash_ sketches and make them available can improve discoverability of these large collections, as well as support future use cases derived from other _Scaled MinHash_ features. --- Compared to previous taxonomic profiling methods, _Scaled MinHash_ can also be seen as a mix of two other approaches: It uses exact $k$-mer matching and assignment, and the $k$-mers selected by the MinHashing process are equivalent to implicitly-defined markers. It differs from previous approaches because only a subset of the $k$-mer composition is used for matching, and traditional gene markers are explicitly chosen due to sequence conservation and low mutation rates, while MinHashing $k$-mers generates a randomized, but consistent across datasets, set of marker $k$-mers. — Despite improvements to standardization and reproducibility of previous analysis, benchmarking taxonomic profiling tools is still challenging, since tools can generate their reference databases from multiple sources and choosing only one source can bias or make it impossible to evaluate them properly. This is especially true for real metagenomic datasets derived from samples collected from soil and marine environments, where the number of unknown organisms is frequently larger than those contained in reference databases. With the advent of metagenome-assembled genomes (MAGs) there are more resources available for usage as reference datasets, even if they are usually incomplete or draft quality. `sourmash` is well positioned to include these new references to taxonomic profiling given the minimal requirements (a _Scaled MinHash_ sketch of the original dataset) and support for indexing hundreds of thousands of datasets. — In this chapter `gather` is described in terms of taxonomic profiling of metagenomes. That is one application of the algorithm, but it can applied to other biological problems too. If the query is a genome instead of a metagenome, `gather` can be used to detect possible contamination in the assembled genome by using a collection of genomes and removing the query genome from it (if it is present). This allows finding matches that contain the query genome and evaluating if they agree at specific taxonomic rank, and in case of large divergence (two different phyla are found, for example) it is likely to indicative that the query genome contains sequences from different organisms. This is especially useful for quality control and validation of metagenome-assembled genomes (MAGs), genomes assembled from reads binned and clustered from metagenomes, as well as a verification during submission of new assembled genomes to public genomic databases like GenBank. — Improvements to the calculation of _Scaled MinHash_ sketches can also improve the taxonomic profiling use case. Exact $k$-mer matching is limited in phylogenetically distant organisms, since small nucleotide differences lead to distinct $k$-mers, breaking homology assumptions. Different approaches for converting the datasets into a set to be hashed (_shingling_) than computing the nucleotide $k$-mer composition, such as spaced $k$-mers [@leimeister_fast_2014] and minimizers [@roberts_reducing_2004] and alternative encodings for the nucleotides using 6-frame translation to amino acid [@gish_identification_1993] or other reduced alphabets [@peterson_reduced_2009], can allow comparisons on longer evolutionary distances and so improve taxonomic profiling by increasing the sensitivity of the containment estimation. These improvements don't fundamentally change the `gather` method, since it would still be based on the same *containment* and *remove element* operations, but show how `gather` works as a more general method that can leverage characteristics from different building blocks and explore new or improved use cases. — `gather` is a new method for decomposing datasets into its components with application in biological sequencing data analysis (taxonomic profiling) that can scale to hundreds of thousands of reference datasets with computational resources requirements that are accessible to a large number of users when used in conjunction with _Scaled MinHash_ sketches and efficient indices such as _LCA_ and _MHBT_. It outperforms current methods in community-develop benchmarks, and opens the way for new methods that explore a top-down approach for profiling microbial communities, including further refinements that can resolve larger evolutionary distances and also speed up the method computationally. — _Scaled MinHash_ sketches are fundamentally a subset of the $k$-mer composition of a dataset, and so any of the techniques described in [@marchet_data_2019] are potential candidates for improving current indices or implementing new ones. The MHBT index can be improved by using more efficient representations for the internal nodes [@solomon_improved_2017] and constructing the MHBT by clustering [@harris_improved_2018], and the LCA index can use more efficient storage of the list of signatures IDs by representing the list as colors [@pandey_mantis:_2018]. The memory consumption of the LCA index can also be tackled by implementing it in external memory using memory-mapped files, letting the operating system cache and unload pages as needed. Current indices are also single-threaded, and don't benefit from multicore systems. Both indices can be used in parallel by loading as read-only and sharing for multiple searches, but is is also possible to explore parallelization for single queries by partitioning the LCA and assigning each partition to a thread, as well as using a work-stealing thread pool for expanding the search frontier in the MHBT in parallel. In any case, the current implementations serve as a baseline for future scalability and can be used to guide optimization and avoid extraneous overhead and common failings of such projects [@mcsherry_scalability_2015]. — "online" approaches `sourmash gather`, the command-line interface that adds further user experience improvements to the API level, also allows passing multiple indices to be searched, providing incremental support for rerunning with additional data without having to recompute, merge or update the original databases. Some limitations of gather and database types (equal results can be hard to detect efficiently with current SBT implementation) The Linear index is appropriate for operations that must check every signature, since it doesn't have any indexing overhead. An example is building a distance matrix for comparing signatures all-against-all. Search operations greatly benefit from extra indexing structure. The MHBT index and $k$-mer aggregative methods in general are appropriate for searches with query thresholds, like searching for similarity or containment of a query in a collection of datasets. The LCA index and color aggregative methods are appropriate for querying which datasets contain a specific query $k$-mer. As implemented in sourmash, the MHBT index is more memory efficient because the data can stay in external memory and only the tree structure for the index need to be loaded in main memory, and data for the datasets and internal nodes can be loaded and unloaded on demand. The LCA index must be loaded in main memory before it can be used, but once it is loaded it is faster, especially for operations that need to summarize $k$-mer assignments or require repeated searches. — Due to these characteristics, and if memory usage is not a concern, then the LCA index is the most appropriate choice since it is faster. The MHBT index is currently recommended for situations where memory is limited, such as with smaller scaled values ($s\le2000$) that increase the size of signatures, or when there are a large number (hundreds of thousands or more) of datasets to index. ### Converting between indices Both MHBT and LCA index can recover the original sketch collection. In the MHBT case, it outputs all the leaf nodes. In the LCA index, it reconstruct each sketch from the hash-to-dataset-ID mapping. This allows trade-offs between storage efficiency, distribution, updating and query performance. Because both are able to return the original sketch collection, it is also possible to convert one index into the other. ### gather Conclusion _Scaled MinHash_ sketches allow scaling analysis to thousands of datasets, but efficiently searching and sharing them can benefit from data structures that index and optimize these use cases. This chapter introduces an index abstraction that can be trivially implementing using a list of sketches (_Linear index_) and more advanced implementations based on inverted indices (_LCA index_) and hierarchical indices (_MHBT_) providing options for fast and memory-efficient operations, as well as making it easier to share and analyze collections of sketches. All these functionalities are implemented in sourmash. — These indices also serve as another set of building blocks for constructing more advanced methods for solving other relevant biological problems like taxonomic profiling, described in Chapter [3](#chp-gather), and approaches for increasing the resilience and shareability of biological sequencing data, described in Chapter [5](#chp-decentralizing). — _Scaled MinHash_ sketches are fundamentally a subset of the $k$-mer composition of a dataset, and so any of the techniques described in [@marchet_data_2019] are potential candidates for improving current indices or implementing new ones. The MHBT index can be improved by using more efficient representations for the internal nodes [@solomon_improved_2017] and constructing the MHBT by clustering [@harris_improved_2018], and the LCA index can use more efficient storage of the list of signatures IDs by representing the list as colors [@pandey_mantis:_2018]. The memory consumption of the LCA index can also be tackled by implementing it in external memory using memory-mapped files, letting the operating system cache and unload pages as needed. --- Current indices are also single-threaded, and don't benefit from multicore systems. Both indices can be used in parallel by loading as read-only and sharing for multiple searches, but is is also possible to explore parallelization for single queries by partitioning the LCA and assigning each partition to a thread, as well as using a work-stealing thread pool for expanding the search frontier in the MHBT in parallel. In any case, the current implementations serve as a baseline for future scalability and can be used to guide optimization and avoid extraneous overhead and common failings of such projects [@mcsherry_scalability_2015].