chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
529 stars 86 forks source link

hifiasm needing too much memory #239

Open dcopetti opened 2 years ago

dcopetti commented 2 years ago

Hello,

I have a problem similar to issue 222, but actually it is the opposite.

I am assembling a plant genome from two SMRT cells of HiFi data. The k-mer distribution shows a peak close to 90x, and I am assuming there is a heterozygous peak around 44x but it is buried among low frequency k-mers: k-mers In total, there are more than 38 Gb of data. With the homozygous peak at ~87x, then the genome should be approximately 438 Mb - just to have a rough estimation.

When I run both cells with Hifiasm ~/bin/hifiasm_0.16.1-r375/hifiasm -t 200 -o Cgil_hifasm_l3 -l3 cell1.hifi_reads.fastq.gz cell2.hifi_reads.fastq.gz 2>Cgil_hifasm_l3_stdout the job gets killed by using more than the 1 TB memory that is available. This is the log file: Cgil_hifasm_l3_stdout.txt

I run one cell, and it completed, though using almost 800 GB of memory. With other plant genomes, even much larger, I never had this problem of too little memory. The assembly looks very fragmented: total size of 835 and 734 Mb, 12,000 and 9,000 contigs, N50 of 82 kb and 100 kb for hap1 and hap2, respectively. I am now running the second cell by itself. Are there some parameters that could be tweaked to reduce the memory needed?

summarizing: From the k-mer curve, it looks like I have >40x per allele (too much coverage?): can that may be a reason for the high memory demand? But assembling half of the data only, I get a much larger assembly with very short contiguity. So it could actually be too little coverage - odd. I am even wondering if the high amount of k-mers below 60x could be contamination of some sort. Can you help me figure out what is happening? Thanks, Dario

chhylp123 commented 2 years ago

See the FAQ here: https://hifiasm.readthedocs.io/en/latest/faq.html#why-does-hifiasm-stuck-or-crash. And an example: https://github.com/chhylp123/hifiasm/issues/93#issuecomment-863916776. Probably no enough coverage or containment.

dcopetti commented 2 years ago

OK, but then how do I solve the memory issue? I don't think it is normal that a 800 Mb genome should need more than 1 TB memory

chhylp123 commented 2 years ago

Looks like it is caused by the data quality issues, such as no-enough coverage or containments. Even the assembly can work, the produced assemblies will still be very fragmented. I have no idea how to fix data quality issues, probably having more coverage, or finding solutions to remove containments?

dcopetti commented 2 years ago

What do you mean for data quality issues? This is how the HiFi data looks like, the median QV is 32 Capture what else could I do about it?

chhylp123 commented 2 years ago

A good HiFi dataset should have a k-mer plot like issue10 or issue49. The k-mer plot of your dataset is very bad, i.e., there are large numbers of k-mers only occurring a few times.

dcopetti commented 2 years ago

yep, I agree that the low frequency k-mers are the issue. But if we add coverage and move some of those k-mers to the right, then there will be more data that a 1 TB machine can't assemble. Do you see the conundrum? How do we get out of here? Thanks

chhylp123 commented 2 years ago

For the weird k-me plots like yours, hifiasm cannot correctly determine the right threshold for error correction, leading to the large memory requirement. If it gets a nice k-mer plot, then the memory won't be a problem as hifiasm is able to identify the right threshold. But I'd recommend you to first check why the k-emr plot is weird.