bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.07k stars 182 forks source link

DIAMOND linclust taking longer than expected #709

Closed snayfach closed 1 year ago

snayfach commented 1 year ago

Hi Ben - I'm running diamond linclust on a massive protein set and it's taking longer than expected based on the data shown in figure S1 of your manuscript.

In that figure, you reported 0.93 hours to run diamond linclust using 64 CPUs on NCBI NR (0.465 billion proteins). I'm trying to recapitulate the database described in your paper (~18B proteins, 5Tb FASTA) using 128 CPUs and 800GB of RAM and diamond v2.1.6.160 with the command: diamond linclust --approx-id 100 --member-cover 90.0 --memory-limit 800G. I expected a running time of ~36 hours, but it's been over 2 weeks (336 hours) and has processed only 34/42 superblocks.

Do you think you could explain why it's taking so long? Should I expect it's 80% complete given that 34/42 superblocks are completed?

Many thanks

bbuchfink commented 1 year ago

If you use --approx-id 100 only identical sequences are clustered together. That means you have a huge representative set, and mapping back to it will take a long time. We used --approx-id 30 for our clustering.

Should I expect it's 80% complete given that 34/42 superblocks are completed?

Yes.

snayfach commented 1 year ago

That makes sense, thanks for your response.

snayfach commented 1 year ago

DIAMOND was killed at the last step "Generating output...". I assume it's an out-of-memory issue, but my machine has over 800G of RAM. Do you understand what happened, and would I get the same result if clustering at 30% identity instead of 100%?

bbuchfink commented 1 year ago

The output generation has to load all accessions into memory at once, so that may fail for huge datasets like this one. I still need to fix this, but it's not that easy and may take some time. One workaround would be to replace accessions with integer numbers to save space. Another is to manually length sort and partition the input file as we did in the paper (the info is in the supplement). The memory use should not change depending on your seqid% setting.

snayfach commented 1 year ago

Those are good suggestions, thank you.

snayfach commented 1 year ago

The output generation has to load all accessions into memory at once, so that may fail for huge datasets like this one. I still need to fix this, but it's not that easy and may take some time

Just checking to see if there are any plans to fix this in the near term

bbuchfink commented 1 year ago

I do want to fix it but it may still take some time.

snayfach commented 1 year ago

Let's say my accessions require 100G of RAM and I have a machine with 1000G of RAM. Would I be in the clear if I ran diamond using a 800G memory limit, leaving 200G for the accessions plus a buffer?

bbuchfink commented 1 year ago

Yes that should work.