chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
535 stars 87 forks source link

Hifiasm never stops running after creating bin-files #341

Open casparbein opened 1 year ago

casparbein commented 1 year ago

I have been using hifiasm with a simulated dataset of reads generated from an assembly that has been itself assembled with hifiasm. To test assembly performance, I simulated read sets of different lengths (CCS reads between 1000 and 10000 bp long) and an average coverage of 30X.

For some reason, with CCS read lengths longer than 2000 bp, hifiasm seems to reach a point where it perpetually (>24 h) runs after creating bin files without ever producing any additional output. I have observed this with various different settings, allocating between 50 and 100 cores and 400 to 1000 GB RAM. Any idea how to solve this? For shorter read lengths and/or lower coverage, I get an assembly of the same simulated read data type in 3-4 hours.

Find attached the log file of an average 2000 bp CCS read length run, genome size is approximately 2.7 Gb, and my input fastq file has around 33 million reads. I use hifiasm version 0.16.1-r375. The command I used was: hifiasm -o assembly/all_asm -t 50 fastq_files/all.fastq.gz

hifiasm_assembly--13509453.txt

Note: in this log file, there is a stretch of error messages. This does also occur in runs with simulated reads where I actually get a gfa output and is therefore probably (?) not the reason of this issue. Also, my kmer histogram does not like it was sampled from a diploid assembly, although I simulated reads from two different haplotypes. Again, I do not think that this is the reason of this error message, as I get very similar logs for assemblies that worked.

Thanks in advance

chhylp123 commented 1 year ago

Sorry for the late reply. I feel like your simulated dataset has N? So it confused hifiasm. It would be better if you could share the simulated dataset with us for debugging. Thank you in advance.