bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.03k stars 183 forks source link

Big difference for different versions #457

Open 473021677 opened 3 years ago

473021677 commented 3 years ago

Hi: I am using diamond to conduct the all-vs-all blastp analysis for the protein file with 1G file size. But I encountered one problem. When I using diamond v0.9.8.109, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. However, when I using diamond v2.0.6, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. There's a big difference between using v0.9.8.109 and using diamond v2.0.6. I don't know what's wrong with it. Could you help me? I really appreciate if you could help. Thanks very much.

Best regards

bbuchfink commented 3 years ago

You listed the same file sizes for both runs, I assume that is an error? You can try to reproduce this problem on a smaller sequence set so I can take a look (please also include command lines). Also, much has changed about the algorithm between these 2 versions, so I would not expect them to produce identical results.

473021677 commented 3 years ago

Sorry, When I using diamond v0.9.8.109, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. However, when I using diamond v, the file size of the resulting daa file and m8 file is 143.63GB and 5.40GB, respectively. There's a big difference between using v0.9.8.109 and using diamond v2.0.6. The command for creating database is "diamond makedb --in Combined_archaea.fasta -d Combined_archaea" and the commond for blastp is "diamond blastp -d Combined_archaea -q Combined_archaea.fasta -a Combined_archaea_diamond -p 20 -e 1e-10 --id 25 -k 250". The view diamond is "diamond view -a Combined_archaea -o Combined_archaea.m8". And when I use a smaller sequence dataset (12.78M), this problem don't appear and the resuting file sizes are similar. Thanks.

Best regards,      ------------------ Original ------------------ From: "Benjamin Buchfink"; Date: 2021年4月19日(星期一) 下午3:43 To: "bbuchfink/diamond"; Cc: "473021677"; "Author"; Subject: Re: [bbuchfink/diamond] Big difference for different versions (#457)

 

You listed the same file sizes for both runs, I assume that is an error? You can try to reproduce this problem on a smaller sequence set so I can take a look (please also include command lines). Also, much has changed about the algorithm between these 2 versions, so I would not expect them to produce identical results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

473021677 commented 3 years ago

Sorry, When I using diamond v0.9.8.109, the file size of the resulting daa file and m8 file is 97.85GB and 37.82GB, respectively. However, when I using diamond v2.0.6, the file size of the resulting daa file and m8 file is 143.63GB and 5.40GB, respectively. There's a big difference between using v0.9.8.109 and using diamond v2.0.6. The command for creating database is "diamond makedb --in Combined_archaea.fasta -d Combined_archaea" and the commond for blastp is "diamond blastp -d Combined_archaea -q Combined_archaea.fasta -a Combined_archaea_diamond -p 20 -e 1e-10 --id 25 -k 250". The view diamond is "diamond view -a Combined_archaea -o Combined_archaea.m8". And when I use a smaller sequence dataset (12.78M), this problem don't appear and the resuting file sizes are similar. Thanks.

Best regards,      ------------------ Original ------------------ From: "Benjamin Buchfink"; Date: 2021年4月19日(星期一) 下午3:43 To: "bbuchfink/diamond"; Cc: "473021677"; "Author"; Subject: Re: [bbuchfink/diamond] Big difference for different versions (#457)

 

You listed the same file sizes for both runs, I assume that is an error? You can try to reproduce this problem on a smaller sequence set so I can take a look (please also include command lines). Also, much has changed about the algorithm between these 2 versions, so I would not expect them to produce identical results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

bbuchfink commented 3 years ago

Try to run diamond view also with -k 250 when using v2.0.6, that may explain the difference in m8 file size.

473021677 commented 3 years ago

I have tried to run diamond v2.0.6 view with -k 250 and the resulting file sizes are similar to that of diamond v0.9.8.109. I will always run diamond v2.0.6 with -k 250. And I want to know why the resulting file sizes were simalar for the smaller dataset(12.78M). Thanks very much.

Best regards     ------------------ Original ------------------ From: "Benjamin Buchfink"; Date: 2021年4月19日(星期一) 下午4:07 To: "bbuchfink/diamond"; Cc: "473021677"; "Author"; Subject: Re: [bbuchfink/diamond] Big difference for different versions (#457)

 

Try to run diamond view also with -k 250 when using v2.0.6, that may explain the difference in m8 file size.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

starling13 commented 3 years ago

I also observe significant speed loss between 0.9 and 2.0: The example run of diamond 0.9.14 and 2.0.9. Query: 518 AA sequences DB: ncbi NR (.dmnd files of 128G, generated by the corresponding version)

Running on 20-core HT server with 128G ram

0.9.14: diamond-0.9.14 blastp --db NR-0.9.14 -q ./query.fasta --out ./report-0.9.txt -p 40 1>./log.0.9.txt 2>&1 Time 12m, 12617 HSPS, 513 query sequences aligned

2.0.9: diamond-2.0.9 blastp --db NR-2.0.9 -q ./query.fasta --out ./report-2.0.txt -p 40 -b5 -c1 1>log.2.0.txt 2>&1 Time 25m, 12711 HSPS, 516 query sequences aligned

bbuchfink commented 3 years ago

I would guess this is due to the runtime repeat masking, so try running with --masking 0. Diamond is not very efficient for such small query files, but improvements in this regard are upcoming.

starling13 commented 3 years ago

I would guess this is due to the runtime repeat masking, so try running with --masking 0. Diamond is not very efficient for such small query files, but improvements in this regard are upcoming.

Thank you for reply. With --masking 0 time decreases from 25 to 20 minutes for version 2.0.9 and stay unchanged (about 10-12m) for 0.9.14

bbuchfink commented 3 years ago

I'm not sure what else could be causing this difference. Optimizations for small query files are available but still in beta stage, as described here: https://github.com/bbuchfink/diamond/issues/419#issuecomment-831154792 It will probably be a couple of weeks until I release this officially.

bbuchfink commented 3 years ago

v2.0.11 now contains some optimizations for small query files. You can also get the old behaviour back using the option --algo ctg, which may or may not improve performance depending on the file size. Note that this option should only be used for small query files.