dib-lab / 2020-paper-sourmash-gather

Here we describe an extension of MinHash that permits accurate compositional analysis of metagenomes with low memory and disk requirements.
https://dib-lab.github.io/2020-paper-sourmash-gather
Other
8 stars 1 forks source link

leftover text from introduction #7

Open ctb opened 3 years ago

ctb commented 3 years ago

Leftover text:

Compositional data analysis is the study of the parts of a whole using relative abundances [@doi:10.1111/j.2517-6161.1982.tb01195.x]. This is a general problem with applications across many scientific fields [@aitchison_compositional_2005], and examples in biology include RNA-Seq [@quinn_field_2019-1], metatranscriptomics [@macklaim_rna-seq_2018], microbiome and metagenomics [@li_microbiome_2015]. Taxonomic profiling is a particular instance of this general problem with the goal of finding the identity and relative abundance of microbial community elements at a specific taxonomic rank (species, genus, family), especially in metagenomic samples [@sczyrba_critical_2017].

Existing taxonomic profilers use different methods to solve this problem, including aligning sequences to a reference database [@huson_megan_2016], using marker genes derived from known organisms from reference databases [@segata_metagenomic_2012] or coupled with unknown organisms clustered from metagenomes [@milanese_microbial_2019], and exact $k$-mer matching using fixed $k$ and lowest common ancestor (LCA) for resolving $k$-mer assignments matching multiple taxons from a reference database [@wood_kraken:_2014] or variable $k$ and assigning multiple taxons per sequence, with an option to reduce it further to the LCA [@kim_centrifuge_2016].

Once each sequence (from raw reads or assembled contigs) has a taxonomic assignment,

these methods resolve the final identity and abundance for each member of the community by summarizing the assignments to a specific taxonomic rank, Taxonomic profiling is fundamentally limited by the availability of reference datasets to be used for assignments, and reporting what percentage of the sample is unassigned is important to assess results, especially in undercharacterized environments such as oceans and soil.