Closed snayfach closed 1 year ago
If you use --approx-id 100
only identical sequences are clustered together. That means you have a huge representative set, and mapping back to it will take a long time. We used --approx-id 30
for our clustering.
Should I expect it's 80% complete given that 34/42 superblocks are completed?
Yes.
That makes sense, thanks for your response.
DIAMOND was killed at the last step "Generating output...". I assume it's an out-of-memory issue, but my machine has over 800G of RAM. Do you understand what happened, and would I get the same result if clustering at 30% identity instead of 100%?
The output generation has to load all accessions into memory at once, so that may fail for huge datasets like this one. I still need to fix this, but it's not that easy and may take some time. One workaround would be to replace accessions with integer numbers to save space. Another is to manually length sort and partition the input file as we did in the paper (the info is in the supplement). The memory use should not change depending on your seqid% setting.
Those are good suggestions, thank you.
The output generation has to load all accessions into memory at once, so that may fail for huge datasets like this one. I still need to fix this, but it's not that easy and may take some time
Just checking to see if there are any plans to fix this in the near term
I do want to fix it but it may still take some time.
Let's say my accessions require 100G of RAM and I have a machine with 1000G of RAM. Would I be in the clear if I ran diamond using a 800G memory limit, leaving 200G for the accessions plus a buffer?
Yes that should work.
Hi Ben - I'm running diamond linclust on a massive protein set and it's taking longer than expected based on the data shown in figure S1 of your manuscript.
In that figure, you reported 0.93 hours to run diamond linclust using 64 CPUs on NCBI NR (0.465 billion proteins). I'm trying to recapitulate the database described in your paper (~18B proteins, 5Tb FASTA) using 128 CPUs and 800GB of RAM and diamond v2.1.6.160 with the command:
diamond linclust --approx-id 100 --member-cover 90.0 --memory-limit 800G
. I expected a running time of ~36 hours, but it's been over 2 weeks (336 hours) and has processed only 34/42 superblocks.Do you think you could explain why it's taking so long? Should I expect it's 80% complete given that 34/42 superblocks are completed?
Many thanks