chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
545 stars 87 forks source link

high memory usage when using hi-c phasing but not with trio binning #602

Open ethan-baldwin opened 9 months ago

ethan-baldwin commented 9 months ago

I am assembling a ~3.5gb genome with around ~100x hifi reads. I want to compare the phasing results between hi-c (omni-c in this case) and trio-binning. The trio-binning completes using 400gb of memory, however the hi-c phasing on the same error-corrected reads is running out of memory even when I give it 950gb (almost the max amount on my university's cluster). There is 170gb of omni-c data.

Here is the log:

Reads has been loaded.
Loading ma_hit_ts from disk... 
ma_hit_ts has been read.
Loading ma_hit_ts from disk... 
ma_hit_ts has been read.
[M::ha_assemble::398.941*0.97] ==> loaded corrected reads and overlaps from disk
[M::ha_opt_update_cov_min] updated max_n_chain to 505
[M::purge_dups] homozygous read coverage threshold: 100
[M::purge_dups] purge duplication coverage threshold: 126
[M::ug_ext_gfa::] # tips::627
Writing raw unitig GFA to disk... 
Writing processed unitig GFA to disk... 
[M::purge_dups] homozygous read coverage threshold: 100
[M::purge_dups] purge duplication coverage threshold: 126
[M::mc_solve:: # edges: 3236]
[M::mc_solve_core_adv::0.232] ==> Partition
[M::adjust_utg_by_primary] primary contig coverage range: [85, infinity]
Writing sarracenia.asm.hic.p_ctg.gfa to disk... 
[M::ha_opt_update_cov] updated max_n_chain to 505
/var/lib/slurmd/job26800127/slurm_script: line 29: 2240832 Killed                  hifiasm -o s.asm -t32 --h1 ../reads/s_OmniC_I1371_L1_R2_P_R1.fastq.gz --h2 ../reads/s_OmniC_I1371_L1_R2_P_R2.fastq.gz ../reads/m84053_231129_210740_s2.hifi_reads.bc2012.fastq.gz ../reads/m84053_231129_213847_s3.hifi_reads.bc2012.fastq.gz ../reads/m84053_231129_220953_s4.hifi_reads.bc2012.fastq.gz ../reads/m84053_231204_221754_s3.hifi_reads.bc2012.fastq.gz
slurmstepd: error: Detected 1 oom_kill event in StepId=26800127.batch. Some of the step tasks have been OOM Killed.
BrianSmart commented 8 months ago

Hey Ethan! Just came across this post because I am doing a similar project with sunflower. Did you end up figuring this out? I'm wondering if you could increase the bloom filter setting (-f) to 38 or 39 to reduce memory usage. Or perhaps you could reduce the maximum k-mer occurance threshold (--max-kocc)? Maybe you could even break up your inputs and somehow run hifiasm on two halves of your data somehow? I'm new to this program so don't feel confident and am just brainstorming! Let me know if you have had any luck. I'm worried I will face the same issue when I get my omni-c data back since my trio-binning run is also using about 400gb of memory!

chhylp123 commented 7 months ago

Actually another solution is to use fewer CPUs, that might be also helpful to reduce the memory. We would like to release a new version that takes less memory soon.

ethan-baldwin commented 7 months ago

Thanks for the helpful replies! I tried with fewer CPUs (64 > 8) and I moved past the stage where I normally ran out of memory, but now I am getting a seg fault. hifiasm.hifiasm_27625222.txt

chhylp123 commented 7 months ago

Hi @ethan-baldwin, I am wondering if you can share the bin files with me? Then I could do a very quick test to fix this issue. This should be a bug, and fixing it will be very helpful for us.

ethan-baldwin commented 7 months ago

I would love to, but the bin files add up to ~300gb. What is the best way for me to share them with you?

chhylp123 commented 7 months ago

Thank you so much @ethan-baldwin! Could you please show me a screenshot for each bin file? Some bin files are not necessary for me to debug.

This issue is likely to be a small bug for the latest version of hifiasm, which has also been mentioned several times by other users. It is very helpful if I can get the data and do a quick test to fix it. Currently there is another option: running an old version with current bin files (see:https://github.com/chhylp123/hifiasm/issues/613).

ethan-baldwin commented 7 months ago

Here is the directory: image

Do you want a screenshot of part of the bin files like this? image

When I have time I will try installing an older version of hifiasm.

chhylp123 commented 7 months ago

@ethan-baldwin Could you please share me sarracenia.hic.ec.bin, sarracenia.hic.ovlp.source.bin, sarracenia.hic.ovlp.reverse.bin and , sarracenia.hic.hic.lk.bin with me? It would be better that you can also share the command lines/hifaism version you were using with me. Thank you so much for your great help!

ethan-baldwin commented 7 months ago

@chhylp123 I sent you an email to discuss how to transfer these large files. Here is the command:

hifiasm -o sarracenia.hic -t 8 \
--h1 ../reads/KXRJ_OmniC_NA_NA_TGAGCTAG_Sarracenia_baldwin_OmniC-Sarracenia_baldwin_OmniC_I1371_L1_R1.fastq.bz2 \
--h2 ../reads/KXRJ_OmniC_NA_NA_TGAGCTAG_Sarracenia_baldwin_OmniC-Sarracenia_baldwin_OmniC_I1371_L1_R2.fastq.bz2 \
../reads/m84053_231129_210740_s2.hifi_reads.bc2012.fastq.gz \
../reads/m84053_231129_213847_s3.hifi_reads.bc2012.fastq.gz \
../reads/m84053_231129_220953_s4.hifi_reads.bc2012.fastq.gz \
../reads/m84053_231204_221754_s3.hifi_reads.bc2012.fastq.gz

And the hifiasm version is 0.19.6

ethan-baldwin commented 6 months ago

@chhylp123 If you are still interested in troubleshooting this issue, I can share these files with you via globus, unless you have another file sharing solution. Thanks!