chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
517 stars 85 forks source link

Memory consumption #541

Open diego-rt opened 10 months ago

diego-rt commented 10 months ago

Hi again,

Apologies for opening a new issue but this time it is on a different topic.

We are trying to assemble a giant genome of ~30 Gb using 70x Hifi coverage and we have encountered a substantial memory consumption of circa ~2T. These are the flags we've used:

hifiasm -o Assembly.asm -t 176 -l 3 -f 40 -D 5 -k 63 -w 63 Revio.fq.gz

These are the resources used:

[M::main] Version: 0.19.6-r595

[M::main] Real time: 810896.293 sec; CPU: 67823081.546 sec; Peak RSS: 1837.470 GB

Notably, the first two error correction rounds are reasonable memory wise but the last one nearly doubles. Is this expected?

[M::ha_assemble::232705.176*86.95@944.972GB] ==> corrected reads for round 1
[M::ha_assemble] # bases: 1987058683818; # corrected bases: 3830287191; # recorrected bases: 3478967
[M::ha_assemble] size of buffer: 27.072GB

[M::ha_assemble::347940.855*104.23@960.163GB] ==> corrected reads for round 2
[M::ha_assemble] # bases: 1987230338200; # corrected bases: 107238566; # recorrected bases: 460786
[M::ha_assemble] size of buffer: 24.663GB

[M::ha_assemble::655301.727*94.94@1837.470GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 1987234212699; # corrected bases: 6497264; # recorrected bases: 541425
[M::ha_assemble] size of buffer: 23.630GB

[M::ha_pt_gen::685751.842*92.90] ==> indexed 45341859427 positions, counted 695474658 distinct minimizer k-mers
[M::ha_assemble::706174.286*95.25@1837.470GB] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 13092338655
[M::ha_print_ovlp_stat] # strong overlaps: 4166602012
[M::ha_print_ovlp_stat] # weak overlaps: 8925736643
[M::ha_print_ovlp_stat] # exact overlaps: 12818368372
[M::ha_print_ovlp_stat] # inexact overlaps: 273970283
[M::ha_print_ovlp_stat] # overlaps without large indels: 13072073906
[M::ha_print_ovlp_stat] # reverse overlaps: 3181263971

Some questions listed as points for brevity:

  1. I understand that -k and -w raise memory consumption but given the speed boost and theoretical assembly benefits I would rather keep using them. I was wondering whether there is anything else I could tweak to reduce memory consumption without sacrificing assembly quality?
  2. Do you think my bloom filter is too large? Should I reduce it? What is the proper way to estimate the -f parameter?
  3. What does size of buffer mean? Is this the size of the bloom filter?
  4. It seems like the memory consumption doubles from rounds 2 to 3. Is there anything that could still be deallocated at that point that perhaps is not? Sorry for the annoying question but our node only has 1900G RAM and memory limits are getting in the way of us experimenting with higher '-D' flags or adding additional reads (i.e. duplex reads) to the mix.

Thanks a lot once again!

chhylp123 commented 10 months ago

@diego-rt Sorry for the late reply:

  1. Unfortunately, no.
  2. It is hard to say. Why would you like to increase it to 40?
  3. It is the buffer size used during the error correction, instead of the bloom filter.
  4. Not really right now.

But we are aware that certain sections of the code can be optimized to reduce hifiasm's memory requirements. Do you need to run it urgently? I can cut a release to reduce the memory requirement.

diego-rt commented 10 months ago

Hi @chhylp123

No worries, thanks a lot for your reply as always!

  1. Ok
  2. I assumed that since you wrote that 37 was ideal for human, in a worst case scenario a genome 10x the size (32 Gb) would also need a ~10x larger bloom filter. But to be honest it was unclear to me how to choose this value or how would I know if the value is being suboptimal (and what the implications would be). Would it help to upload the full log?
  3. Ok
  4. Ok

Actually that would be incredible! At the moment its taking roughly 10 days to run (even with 88 cores) only to run out of memory during the last correction round, so we can't really run it on our full dataset yet.

Thanks a lot for your help once again!

chhylp123 commented 10 months ago

Thanks. I will cut a new release soon.

chhylp123 commented 10 months ago

But right now you may also consider to use less number of CPUs and smaller value for -f. It may reduce the memroy requirement.

diego-rt commented 10 months ago

Hi, does this week's release already implement the change?

chhylp123 commented 10 months ago

Not yet. I will push it to the github HEAD this weekend.

diego-rt commented 10 months ago

Ok, super! Thanks a lot!

diego-rt commented 9 months ago

Hey there,

Sorry to bug you again 😅 Do you have any news on the memory consumption update?

diego-rt commented 7 months ago

Hi @chhylp123

Hope you are doing well and had a good start of the year! Do you by any chance have any updates on the memory consumption fix?