ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Use blastn_vdb instead of bowtie2? #158

Closed rcedgar closed 3 years ago

rcedgar commented 4 years ago

Moving this suggestion by @taltman from slack to here for discussion:

"I’ve written up an analysis of using the SRA toolkit’s blastn_vdb program, which allows for performing alignments against one or more SRA files without the need to dump the contents to FASTQ first. The sensitivity is superior to Bowtie2 based on the benchmarks that @Artem Babaian performed. Runtime to process a “real” SRA file is one minute, using the SARS-CoV-2 reference genome as a “query”." Full analysis here: https://github.com/ababaian/serratus/blob/taltman-dev/notebook/200518_ta_blastx-reads/README.md

rcedgar commented 4 years ago

Thanks, very interesting. A few initial thoughts, with apologies in advance if I missed stuff by skimming. With identity as low as 60%, false positive alignments may be a problem. With an E-value of 0.001 and -max_target_seqs 100000, blast itself is predicting up to 100000*0.001 = 100 false positives. With this in mind, how would we select datasets in practice? In a dataset like SRR1168901 (~2.6K reads low coverage to KF600647.1) Cov might be totally buried in the noise. With bowtie2, the cost of increasing the mapping reference size is negligible, while blast search time scales linearly with the size of the database. With the ~1k full-length Cov genomes we use in the current bowtie2 mapping reference, I would therefore expect this approach to run 1000x slower. Then there are ~50 other virus families in addition to Cov. The notebook raises the possibility of using a smaller Cov reference given the increased sensitivity of blastn. Clustering full-length Cov genomes at 90% nt identity reduces the number of genomes to 135, and I doubt we would be comfortable clustering more aggressively than this. In addition, there are many Cov species known only from single-gene amplicon sequences, and there are ~50 other virus families in the current bowtie2 reference.

taltman commented 4 years ago

Hi @rcedgar , thanks for converting this to an issue.

Couple of responses:

ababaian commented 4 years ago

In the end bowtie2 specificity is what opted us to select it over bwa which was more sensitive. Robert ran a series on real data looking at CoV- libraries. Do you know where those notes are Robert? It'd be good to have them in the repo, I think it was an email.

Given this sensitivity, the specificity is the only really critical parameter

rcedgar commented 4 years ago

The relevant benchmark is a classifier test, not an alignment test. We can tolerate false positive alignments if, but only if, the classifier can handle them. It is not clear that blasn_vdb has better sensitivity unless we use a comparable mapping reference and align all reads. In the initial test of blastn_vdb, there was only one Cov genome and only a subset of reads were aligned. Using a comparable mapping reference will increase the cost by two orders of magnitude. Aligning all reads will also increase the time. Then, finding valid Cov alignments is not sufficient validation. To be practically useful, we need a classifier which gives higher scores to datasets with Cov (or some other virus family), We need to know with high confidence that there is a decisive advantage to a new nt alignment method very, very quickly because we have three weeks left. We will need to re-run host screening and black-region analysis because blastn surely has different characteristics than bowtie2, and the summarizer will also need to be modified because blastn does not generate SAM and it is primarily based on coverage, which may be very low with this approach even if there is a complete virus in the data.

rcedgar commented 4 years ago

To @taltman's point about the E-value threshold, it is true that there may be many FPs. But note a couple of points. 1. The method is diamond+summarizer, not diamond alone. 2. With diamond, all reads from a dataset are aligned to a viral mega-proteome with dense coverage of Cov and moderate coverage of 50 other families. This enables an assessment of gene coverage which can tolerate false positives. I don't see how to accomplish this with blastn_vdb.

rcedgar commented 4 years ago

To find Cov-'s, you could use any set of summarizer files and select those with Cov scores of 0. Of course, this assumes that the current summarizer didn't miss anything, but I would say this is a better standard than the previous Cov-'s. If blastn_vdb finds a convincing hit in a dataset which bowtie2+summarizer scores at 0 this would be very interesting.

rcedgar commented 4 years ago

For Cov+'s, a good starting point is the assembler benchmark set because PEDV and IBV are far from Cov-2. And of course Frank and Ginger. Don't want to miss these guys...

rcedgar commented 4 years ago

I think we should clarify the motivation for this exercise before we put more work into it. Diamond is MUCH more sensitive than blastn. Bowtie2+covm+summarizer works very well. So given where we are, even if we could implement a new nt aligner and new summarizer and curate the nt reference, is this the best use of our time?

taltman commented 4 years ago

align all reads

Just to clarify, blastn_vdb does align all reads against the query. It just limits the number of matches for the output (which is only truncated the output for the SARS-CoV-2 lung lavage sample, which was super-saturated in CoV reads).

taltman commented 4 years ago

because blastn does not generate SAM

Actually, both blastn and blastn_vdb now support SAM output, which can be piped to SAMtools for conversion to BAM. See the docs below:

 *** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    17 = Sequence Alignment/Map (SAM),
    18 = Organism Report
taltman commented 4 years ago

I don't see how to accomplish this with blastn_vdb.

I plan on testing a database of CoV gene nucleotide sequence, which should provide a similar tolerance to false positives. I can also back-translate the sequences to nucleotide-space and encode the codons using ambiguous nucleotide codes to allow for sensitivity comparable to blastx.

taltman commented 4 years ago

One update: it takes ~1 minute to run on SRA samples with CoV reads, and ~20 seconds on a sample without CoV reads. I need to screen a larger set of CoV- samples, to see if this holds true.

taltman commented 4 years ago

Diamond is MUCH more sensitive than blastn.

It would be good if we could have a shared set of simulated reads from which to demonstrate the sensitivity of these tools. I would propose:

  1. Download all of the genes (nucleotides) from the NCBI reference genome for SARS-CoV-2
  2. Perform the various levels of mutation while generating reads as from @ababaian 's experiments

Then, we could gauge the sensitivity of diamond vs. blastn_vdb.

rcedgar commented 4 years ago

I am certain that diamond is more sensitive, I recommend against spending time on verifying that.

rcedgar commented 4 years ago

The best test to verify is the cross-genus hold out that I did with diamond. Take all of the betacovs out of the reference and see if you can detect Cov-2 with only alphacovs in the reference. This test must include a classifier with positive and negative datasets to show that you can distinguish signal from noise. This test shows that bowtie2 probably cannot detect a new genus, while diamond probably can. There may be no undiscovered Cov genus, but if there is we now have a shot at finding it.

taltman commented 4 years ago

Diamond's own benchmarking results show it as not as accurate as blastn and blastx, depending on what mode it is running it. But it's close, and it's definitely more performant for sifting through large piles of reads.

rcedgar commented 4 years ago

Sounds like we have converged, closing issue.

taltman commented 4 years ago

No, I definitely think that blastn_vdb may have a role to play. I am running a benchmark against Bowtie2. Based on that, it might be able to act as a quick 'predicate' to see whether more computational resources should be thrown at a given sample.

taltman commented 4 years ago

I just ran Bowtie2 against the 40% divergence read set from Artem's simulations, with the cov3ma database as the reference. The alignment rate was just 2.13%, compared with ~63% for blastn. In other words, blastn with a single genome as a reference is doing better than Bowtie2 using a ginormous pile of coronavirus genomes.

Here is the Bowtie2 output:

 time bowtie2 --very-sensitive-local \
      -x cov.index -1 benchmark/fq/sim.cov.12000_1.fq -2 benchmark/fq/sim.cov.12000_2.fq > sim.cov.12000.bt2.sam
> 7475 reads; of these:
  7475 (100.00%) were paired; of these:
    7412 (99.16%) aligned concordantly 0 times
    53 (0.71%) aligned concordantly exactly 1 time
    10 (0.13%) aligned concordantly >1 times
    ----
    7412 pairs aligned concordantly 0 times; of these:
      0 (0.00%) aligned discordantly 1 time
    ----
    7412 pairs aligned 0 times concordantly or discordantly; of these:
      14824 mates make up the pairs; of these:
        14632 (98.70%) aligned 0 times
        66 (0.45%) aligned exactly 1 time
        126 (0.85%) aligned >1 times
2.13% overall alignment rate

real    0m0.660s
user    0m0.554s
sys 0m0.085s
brietaylor commented 4 years ago

Having a hard time following all the biology talk. I appreciate that you've taken performance into account. In the future it would also be super useful if you could also include user/sys time when measuring compute.

For single-threaded, CPU-bound programs these will be comparable (user ~= real), but for multi-thread CPU-bound or non-CPU-bound programs these will be very different. Knowing which class these programs fall into would be really useful from a scaling standpoint (coming from the point of view of a non-expert).

At the moment with bowtie2 we're highly CPU-bound, so as a result we've aimed to optimize threading parameters for low total user/sys time. This might change in the future if we change the alignment program but for now that's the main quantity of interest, so it would be useful to at least know how these algorithms compare on that front.

brietaylor commented 4 years ago

Also, maybe you did this, but I didn't see it mentioned. When gathering timing information, you should aim to run each test multiple times. The first run will almost always be the slowest, since the data is cold and needs to be pulled from disk.