ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Chao-1 analysis #146

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

How many novel Cornoavirus species in nature remain to be discovered? This is a fundamental biological question that we are uniquely positioned to answer. Estimating the number of things that nobody has seen seems silly at first, but there are well-established methods for doing this.

Consider two similar problems: counting parrot species in a rain forest and counting dinosaur species from fossils. In traditional biodiversity studies, it is common that only a limited fraction of species are identified in a given ecosystem, and that the counting has strong observational biases. E.g., if an ornithologist counts parrot species in a jungle, s/he will count more brightly colored species who forage low in the ground and will miss well-camouflaged species who live in the canopy. Dinosaur fossils are biased towards species that live in or near water because fossils form more easily in silt than on dry ground (that's why there are so few human fossils). In situations like this, Chao-1 is used to estimate the total number of species.

Chao-1 is a "non-parametric alpha diversity estimator", meaning that it extrapolates (estimator) from the number of observed species (alpha diversity) to the total number of species with essentially no assumptions (non-parametric) about the sampling. The formula is simple: N_total = N_obs + N_1^2/(2 N_2), where N_1 is the number of singleton species (species observed exactly once) and N_2 is the number of doublet species (observed exactly twice).

Non-parametric estimators

When I saw this formula, I was very skeptical because it seemed obvious that you cannot get a good extrapolation without strong assumptions about the shape of the distribution. I worked on this problem for a while with the help of my Danish mathematician friend Henrik Flyvbjerg, and in the end convinced myself that it works well in practice, though it still seems too good to be true. It is well established in the biodiversity and microbial metagenomics literature, so with a bit of luck it won't be controversial with referees.

To a Chao-1 analysis, the starting point is the abundance distribution of Coronavirus species found by Serratus. Abundance is calculated by counting datasets, not reads. For each species, I need to know how many different datasets that species was found in. As a sanity check, this should include species that were found in zero datasets. If Chao-1 predicts fewer unseen species than this, the method has a problem.

This is not as simple as it first appears, because many known Cov sequences belong to species which have not been named by taxonomists. For example, KY770860 is a Bat Cov complete genome which has no species name -- it's "unclassified Coronavirinae" where Coronavirinae is a subfamily, so it doesn't have a genus, sub-genus or species name. Strictly speaking, a sequence like this does not belong to a species at all because viral species are arbitrary categories assigned by humans, there is no objective way to tell if two viruses are in the same species or not.

The problem of missing species names also occurs in bacterial (16S) and fungal (ITS) amplicon metagenomics, where most sequences belong to unnamed species. It is addressed by clustering sequences into "Operational Taxonomic Units" which are designed to be roughly equivalent to species. With 16S, the conventional clustering threhsold is 97% which is (wrongly) believed to be a good approximation to species.

We can do something similar. With full-length genomes, 92% clustering corresponds roughly to species. I used this method to assign all full-length genomes to 190 OTUs in this tsv file: s3://serratus-public/rce/otus/otus92.tsv. There are two fields: 1. OTU 2. accession, e.g.:

Otu1       MG772808
Otu1       MK334045
Otu1       NC_005831
Otu2       AF220295
Otu2       FJ415324

Each SRA should be assigned to an OTU as follows. If the top Cov hit is a full-length genome with an accession in otus92.tsv, assign it to that OTU. Otherwise, it does not contain a Cov species for the purposes of this analysis. Species that are known only from amplicon fragments are considered not yet identified.

rcedgar commented 4 years ago

Updated otus92.tsv has been posted. Update to method: we should set thresholds for a reliable species detection, suggest summarizer score >= 80 and pctid >= 95 to start.

rcedgar commented 4 years ago

First round analysis done. Thanks to @gherman for making the OTU abundance table. Preston plot shows singleton bin is much bigger than the others. This is good news / bad news. If the singleton bin was smaller than the doublet bin or comparable, we could use Chao1 to estimate unseen species, but this plot shows that Chao1 is unreliable. Goodish news is that the best inference from this plot is that there are many more species to find, though it's not something you'd want to bet heavily on. So unfortunately this is a bit inconclusive, maybe it will look better when we have more data. Closing this issue.

ababaian commented 4 years ago

How do you go from OTU abundance to Chao-1?

rcedgar commented 4 years ago

The Chao-1 formula is trivial see this web page. Using Preston plots to figure out if Chao-1 is meaningful is not trivial, see this paper by Edgar and Flyvbjerg.