bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.06k stars 182 forks source link

diamond slow, compared to blastp ? #413

Open k--r opened 3 years ago

k--r commented 3 years ago

I am trying to determine, whether to add sequence searching by diamond to our website https://biocyc.org/ , in addition to blast based searches. BioCyc nowadays has about 18000 genomes (mostly bacterial). The use case is to support single sequence searches by users, across all the genomes. The search sequences could be either amino acid sequence for proteins, or corresponding nucleotide sequences.

I did a simple test on one of our servers, after installing diamond v2.0.4.142 . I built the binary .dmnd DB from all 18000 genomes. The file size was almost 30GB. The query sequence, as an aa Fasta file, was for E.coli trpA, a protein of 268 aa.

When I ran this on the server, giving it 6 cores, this ran for 9min. 53sec.

I ran the corresponding search by blastp (Protein-Protein BLAST 2.3.0+). The corresponding binary blast DBs are bigger in size, but are split up into multiple files. The same query sequence, on the same server with 6 cores, ran for 3min. 31sec.

Below, I am providing the detailed invocations, and the detailed info for one of the cores.

I was surprised to see that diamond was so much SLOWER than blast, as diamond is advertised as being faster.

I wonder, whether the difference is that blast uses multiple files for its binary DB, whereas diamond uses one monolithic, gigantic file. So blast might be good at just reading the correct part of the DB, whereas diamond is forced to read everything. Indeed, looking at "top" and the messages printed in the terminal, at appears that the 6 cores were only fully busy for relatively short bursts of time, interrupted by lots of disk reading activity.

I tried the experiment of loading the entire .dmnd file into a /tmpfs/ system (on the Ubuntu 16.04 Linux system). Thereafter, the entire 30GB .dmnd file should be residing in RAM, and should eliminate disk accesses for reading. The result of that run was that it took 6min. 23sec. So, it is faster, but still SLOWER than blast, reading from disk.

I am aware that diamond is intended to use large query files, such that the time spent per sequence in the query file becomes fast, overall. The use case we have is the opposite. We would like this to be fast for individual, small query inputs.

This could probably work, if diamond would have a mode, by which it could run as a "daemon", which loads its .dmnd file once, and then stays alive for a long time, accepting short query inputs, and directly returning the results. In the documentation I saw, I did not find mentioning of this kind of operating mode. Did I miss something ?

Or if diamond is missing such a daemon mode, is this a feature that you would be willing to implement ? I think it could be tremendously useful, and it is probably not very difficult to implement. Let me please know, what you think of this suggestion.

Thanks in advance, Markus Krummenacker.

pecocyc:biocyc15:~ > time ./diamond blastp --threads 6 --sensitive --outfmt 5 -d ./biocyc-all.dmnd -q trpA.fsa -o trpA-diamond-p-biocyc.xml diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science Documentation, support and updates available at http://www.diamondsearch.org ... Total time = 593.291s Reported 25 pairwise alignments, 25 HSPs. 1 queries aligned. 1839.440u 77.432s 9:53.44 323.0% 0+0k 58640208+1680io 46pf+0w pecocyc:biocyc15:~ >

pecocyc:biocyc15:~ > time ./diamond blastp --threads 6 --sensitive --outfmt 5 -d /tmpfs/biocyc-all.dmnd -q trpA.fsa -o trpA-diamond-p-biocyc.xml diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science ...

Total time = 383.433s Reported 25 pairwise alignments, 25 HSPs. 1 queries aligned. 1820.968u 69.736s 6:23.57 492.9% 0+0k 8032+1680io 36pf+0w pecocyc:biocyc15:~ >

pecocyc:biocyc15:~ > time blastp -db /export/home/biocyc152/pecocyc/aic-export/BlastDB/biocyc-all.fsa -query trpA.fsa -evalue 10 -html -seg no -matrix BLOSUM62 -gapopen 11 -gapextend 1 -num_threads 6 -out trpA.blastp-biocyc.html 530.992u 10.000s 3:31.01 256.3% 0+0k 40333496+816io 1560pf+0w pecocyc:biocyc15:~ >

processor : 23 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz stepping : 4 microcode : 0x42e cpu MHz : 1381.488 cache size : 15360 KB physical id : 1 siblings : 12 core id : 5 cpu cores : 6 apicid : 43 initial apicid : 43 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit bogomips : 4202.57 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

bbuchfink commented 3 years ago

Hi Markus, as you have noted correctly, Diamond is optimized to be used with large query files. If you use 1,000,000 proteins as input, you will surely get a big speedup. The Diamond algorithm does not work well for small queries, mostly due to the use of multiple spaced seeds and runtime generation of indices. When using the sensitive mode with 16 shapes, the index that Diamond generates at runtime for a 30 GB database is about 4.3 TB in size, so it is too big to save on disk or keep in memory using a daemon process.

That said, I have recognized this shortcoming of course and plan to build a mode into Diamond that is better suited for this small query problem. It is not a simple modification however and the timeframe will probably be 2-3 months from now.

I'd also like to understand your application better, for example, do you essentially need the full sensitivity of BLAST or is a reduced sensitivity in the range of <40% identity (like when running Diamond in sensitive mode) also sufficient for you?

oschwengers commented 3 years ago

Hi @bbuchfink , thanks for your very interesting comment on your plans! May I add another scenario that I'd be very much interested in and which, for sure, will also be of relevance to others? In bacterial genome annotation, e.g. Bakta, we often have rather small protein query files (~5k entries) and run them against larger DB, e.g. UniRef90 (~80,000,000 entries). Here, sensitivity is not an issue (at least for me) as I run diamond with --query-cover 80, --subject-cover 80, --id 90 anyways.

Is there a possibility that you could also tweak the algorithm towards these use cases? That would be very useful and very much appreciated!

bbuchfink commented 3 years ago

Hi Oliver, Diamond was not designed for this use case of >90% identity hits only, so I'm pretty sure that substantial speedups would be possible there. Simply building a faster mode that uses longer seeds would be pretty easy, I can look into that in the next days. When it comes to further improving performance for small query numbers, that will take some more time, as I explained.

k--r commented 3 years ago

That said, I have recognized this shortcoming of course and plan to build a mode into Diamond that is better suited for this small query problem. It is not a simple modification however and the timeframe will probably be 2-3 months from now.

I'd also like to understand your application better, for example, do you essentially need the full sensitivity of BLAST or is a reduced sensitivity in the range of <40% identity (like when running Diamond in sensitive mode) also sufficient for you?

Hi Benjamin,

Thanks for your reply and detailed explanation. Also thanks for looking into implementing a mode for the small queries.

Our main goal would be to provide better interactivity to our users, when launching small queries, so they receive results fairly quickly, instead of having to wait for several minutes. So I think the Diamond sensitive mode is probably totally adequate.

Of course, if you were to succeed in providing BLAST sensitivity at a speed that is noticeably faster than BLAST, this could not possibly hurt... :-)

Thanks again for all your efforts, Markus Krummenacker.

oschwengers commented 3 years ago

@bbuchfink , FYI: to keep things better traceable and manageable, I opened a distinct issue to address potential speedups for these high identity use cases -> #419 Thanks a lot for taking care of this!