ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

Test bvfilter as alternative to checkv #169

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

RCE test on 1k assemblies I tested bvfilter with 16mer index (0.5 Gb RAM) on contigs from the 1k test. The filter thinks only 5,549 out of 10,021 total contigs are good Cov (45%). Edit -- Turns out most of these are coronaspades gene cluster contigs, which are expected to contain some non-Cov.

Downloads Files posted to S3:

  1. Binary file bvfilter
  2. Predictions for all contigs in 1k assembler test: all_contigs.tsv
  3. 16-bit cov index file cov.k16.bv

Usage bvfilter -search_bitvec contigs.fasta -ref cov.k16.bv -tabbedout results.tsv

Output Tsv file with these fields:

  1. FASTA defline. To truncate at first white space use -trunclabels option.
  2. Number of k-mers found.
  3. Fraction of query k-mers found in the index.
  4. T (Cov) or F (non-Cov) prediction.

The prediction is based on minimum of 10% Cov k-mers. The maximum is somewhere around 30%, not 100%, for various technical reasons too tedious to explain here.

Speed <1 sec to process 10k contigs = 30Mb FASTA.

rcedgar commented 4 years ago

Further test results: bvfilter has 0 FPs on flom2 (our non-Cov virus reference) and 99.93% positive rate on covref (our comprehensive Cov mapping reference), the remaining 7 apparent FNs are patents and very short sequences.

rchikhi commented 4 years ago

what was it trained on? (i.e. where do the 16mers come from)

rcedgar commented 4 years ago

Yes, the 16mers come from covref, so we need to be careful that it is not over-trained and will generalize to new Covs. I've done some sanity checks and I'm confident that it is robust down to around 70% id with a known Cov sequences. For example, the bvfilter score distribution on the 1k assembly motifs is strongly bimodal between predicted Cov and predicted non-Cov with a large gap between the two peaks. This is what I would expect if the predictions are correct and quite different from what we would see if bvfilter had a sensitivity problem and most of the contigs are in fact Cov. Also, I spot-checked predicted false-positive contigs by BLASTing them and found no Cov, they were mostly host polymerase.

rcedgar commented 4 years ago

Closing issue, checkv seems to be good enough.