chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
533 stars 87 forks source link

ONT read integration #483

Closed nickgladman closed 1 year ago

nickgladman commented 1 year ago

Hello, I am having an ONT problem--this might be related to Issue #474 . The program runs fine with only using the HiFi reads. My script is basically identical to #474 :

hifiasm \ -o tx3362.asm \ -t 32 \ -l 0 \ -z 20 \ --ul /path/ont_merged.fastq.gz \ /path/hifi_reads.fastq.gz I'm allocating 96 Gb of memory, but the program is not drawing more than 70 Gb currently. It stalls after the indexing step and has been running for ~19 hours (wallclock) since index creation.

Writing reads to disk... Reads has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. bin files have been written. [M::ul_load::] ==> UL [M::ha_opt_update_cov] updated max_n_chain to 335 [M::append_inexact_edges] # inserted inexact edges: 215112 [M::gen_cov_track::] # bases: 3147092976 [M::dedup_HiFis::] # unitigs: 171331, # edges: 556848, # cc_num: 1836557 [M::ha_ct_shrink::9624.602*28.15] ==> counted 541557 distinct minimizer k-mers [M::ha_ft_ul_gen::9625.549*28.14@51.275GB] ==> filtered out 541557 k-mers occurring 335 or more times [M::yak_count] collected 337287439 minimizers [M::ha_pt_ul_gen::9645.611*28.13] ==> counted 39985092 distinct minimizer k-mers [M::ha_ct_shrink::9645.650*28.13] ==> counted 39985092 distinct minimizer k-mers [M::yak_count] collected 337287439 minimizers [M::ha_pt_ul_gen::9669.360*28.11] ==> indexed 337287439 positions [M::uidx_l_build] Index has been built. [M::uidx_write] Index has been written.

icemduru commented 1 year ago

In my experience, the --ul option significantly increases the run time. For example, I ran a job on a relatively small genome (450 MB) with the --ul option, and it took about 27 hours to finish. Without the --ul option, the same job took only 5 hours.

nickgladman commented 1 year ago

Ah thanks. I will let it spool and update once the job ends.

nickgladman commented 1 year ago

Hello. Just updating that the run completed without issue. It took ~150 compute days. Genome was ~ 700 Mb.

chhylp123 commented 1 year ago

Hello. Just updating that the run completed without issue. It took ~150 compute days. Genome was ~ 700 Mb.

Thanks a lot. But 150 compute days might be too slow. How much coverage do you have?

nickgladman commented 1 year ago

Using averaged read length: Estimated coverage for PacBio = ~28x Estimated coverage for ONT = >100x (but using median length is ~70x)

chhylp123 commented 1 year ago

So the total CPU hours are 3600?

nickgladman commented 1 year ago

Yes. I am also assuming that I did something wrong, but so far the outputs look like what I'd expect based on QC and compared to other assemblies.

chhylp123 commented 1 year ago

Thanks a lot. Probably because the UL step is too slow due to the very high coverage.