ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

Where are the novel Covs? #140

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

Why are almost all Covs we're finding >97% identity to a known reference sequence? The species threshold is around 90%-92%, so everything above 97% certainly belongs to a known species and is very likely to be a known strain. @ababaian made a scatterplot from our early runs that showed a flat distribution of hits vs. identity which we interpreted as a prediction that there would be many Franks and Gingers. What happened? Even if we assume a mistake making the scatterplot prediction, it's still puzzling. The most obvious explanations are (1) sampling bias (2) sensitivity problem or (3) real biology. There is some sampling bias due to datasets where the host was known to be sick (e.g. deliberately infected), but there are plenty of non-human datasets where the virus is captured incidentally, so I don't believe (1). We know we have good sensitivity down to at least 85% identity and probably to <~80% for conserved core regions of a virus with average identity <<85% overall, so (2) cannot explain an uninhabited desert from 85% to 97%. This leaves (3). If most Coronavirus strains in nature are already in Genbank this would explain it, but I find it hard to believe that all Cov strains are known, there are just too many different host species out there for one thing, and plenty of different species per host assuming human is nothing special. It's plausible that all human Covs are known, but surely not all chicken, bat and frog Covs. Possibly I'm making a dumb biology-for-physicists mistake here, in which case please educate me. Otherwise, I think this is an important question which we should address for the paper. A couple of things I would like to see as a starting point. One is to re-do the scatterplot analysis with more recent results. Instead of a scatterplot, I would suggest a histogram with the number of Cov detections (say, summarizer score > 50) per %id as reported by the summarizer for each integer-rounded %id 80, 81, ... 99, 100. Another is to do a [Chao-1 diversity analysis]((https://palaeo-electronica.org/2011_1/238/estimate.htm) to predict the number of Cov species which are detectable by Serratus but not found yet. To do this, take each hit with >90% identity and score >90 and assign it to the species of the top reference sequence. Count the number of hits for each species. If someone makes a tsv file with (species,count) for each detected species, I can do the Chao-1. Are these analyses possible with Tantalus?

Edit to clarify murky explanation: here a "hit" is a dataset which contains virus species X, not an alignment of one read to a reference sequence of X. The tsv file could be one line for each SRA dataset with (SRA accession, Cov species name) which might be easier to generate; it's trivial to count how many times each species appears.

ababaian commented 4 years ago

That's exactly the type of thing we can work towards yes! @fransilvion

taltman commented 4 years ago

What is Tantalus?

I happen to be in the midst of benchmarking several methods for accurate reference sequence count assignment from a set of reads. I can create a Snakemake workflow to do this for Serratus samples.

rcedgar commented 4 years ago

Tantalus = R packages for analyzing Serratus data. For the Chao-1 analysis, the abundance of species X is the number of SRA datasets which contain X, not the number of reads in a dataset. Edit -- re-reading my text above, I can see this wasn't clear, have edited in attempt to clarify. My bad.

taltman commented 4 years ago
ababaian commented 4 years ago

This the output of Serratus ( https://github.com/ababaian/serratus/wiki/Summarizer-reports ) and Tantalus is the parsing and exploration of this data in R.

rcedgar commented 4 years ago

The summarizer score, see here for explanation of how the summarizer predicts viruses: https://github.com/ababaian/serratus/wiki/Summarizer-reports.

This wiki page discusses multi-mapped reads and other issues with virus species detection. The Chao-1 analysis would start from summary reports, not BAMs.

taltman commented 4 years ago

So your question is predicated on an assumption that we have processed enough of the space of samples to be confident that we're not seeing the predicted numbers of divergent CoV genomes. I don't have any sense of how far we are at this point. Half way?

rcedgar commented 4 years ago

We don't have a prediction of how many divergent Cov genomes there are. But we have enough of a sampling to make a prediction of how many more divergent Cov genomes we would find by running more datasets and using the same detection methods, and I'd like to do that if it's easy to generate the abundance data. What bothers me more is the absence of Covs in the 85-97% range, which should be easily detected and I'm not seeing them. This could be my mistake because I'm doing very, very quick and very, very dirty checks, and as a first step I'd like to see what comes out of a more careful check by Tantalus. If that confirms the desert, then I think we have an important puzzle to solve.

rcedgar commented 4 years ago

Issue resolved. The answer is option (2) sensitivity problem. The "problem" is that we "only" found 371 Covs out of 300k datasets or whatever. That's not nearly enough sensitivity to get into a lot of novelty. Let's say only 1/1000 pangolins have Cov, then how many pangolin Covs are we going to find? Duh :unamused: So the real problem is the SRA is too small :grin:

ababaian commented 4 years ago

Possible solution. We should complete our first pass of vertebrate SRA data this weekend. We solicit for anyone with meta-genome / meta-transcriptome or RNA-seq data from any vertebrate that is not in SRA to share their data with us and we will process it for free in under 24 hours. There is probably another few hundred petabytes of 'dark' RNA-seq data that is sitting on harddrives all over the world but it's a social problem to encourage people to share.