Closed rcedgar closed 4 years ago
Further test results: bvfilter has 0 FPs on flom2 (our non-Cov virus reference) and 99.93% positive rate on covref (our comprehensive Cov mapping reference), the remaining 7 apparent FNs are patents and very short sequences.
what was it trained on? (i.e. where do the 16mers come from)
Yes, the 16mers come from covref, so we need to be careful that it is not over-trained and will generalize to new Covs. I've done some sanity checks and I'm confident that it is robust down to around 70% id with a known Cov sequences. For example, the bvfilter score distribution on the 1k assembly motifs is strongly bimodal between predicted Cov and predicted non-Cov with a large gap between the two peaks. This is what I would expect if the predictions are correct and quite different from what we would see if bvfilter had a sensitivity problem and most of the contigs are in fact Cov. Also, I spot-checked predicted false-positive contigs by BLASTing them and found no Cov, they were mostly host polymerase.
Closing issue, checkv seems to be good enough.
RCE test on 1k assemblies I tested bvfilter with 16mer index (0.5 Gb RAM) on contigs from the 1k test. The filter thinks only 5,549 out of 10,021 total contigs are good Cov (45%). Edit -- Turns out most of these are coronaspades gene cluster contigs, which are expected to contain some non-Cov.
Downloads Files posted to S3:
Usage
bvfilter -search_bitvec contigs.fasta -ref cov.k16.bv -tabbedout results.tsv
Output Tsv file with these fields:
-trunclabels
option.The prediction is based on minimum of 10% Cov k-mers. The maximum is somewhere around 30%, not 100%, for various technical reasons too tedious to explain here.
Speed <1 sec to process 10k contigs = 30Mb FASTA.