chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
531 stars 87 forks source link

results without haplotypes #450

Open ChaeheeLee opened 1 year ago

ChaeheeLee commented 1 year ago

Hello!

First of all, thank you for the development and continuous maintenance of this essential tool.

I am assembling plant genomes that have ~65 repeat content with ~40 X coverage HiFi data. I would like to get some help on getting assembly results without haplotypes when I changed some parameters. Here is the example command I used:

hifiasm -o sample1.v19_5_l3 -t 32 -l 3 --h-cov=200 --b-cov=2 --m-rate=0.75 sample1.ccs.bam

I used --h-cov and --b-cov to minimize misjoins, which I think might be beneficial for (reference-guided) scaffolding later.

The issue is that I am getting only primary assembly for most samples although I have a few samples that I got all primary, hap1 and hap2.

Here is the end part of log file for run without any haplotypes,

Writing reads to disk... Reads has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. bin files have been written. [M::purge_dups] homozygous read coverage threshold: 40 [M::purge_dups] purge duplication coverage threshold: 51 [M::ug_ext_gfa::] # tips::30 Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... [M::purge_dups] homozygous read coverage threshold: 40 [M::purge_dups] purge duplication coverage threshold: 51 [M::mc_solve:: # edges: 356] [M::mc_solve_core_adv::0.025] ==> Partition [M::adjust_utg_by_primary] primary contig coverage range: [34, infinity] [M::break_ug_contig] break potential misassemblies with <20-fold coverage [M::break_ug_contig] break potential misassemblies with >200-fold coverage Writing sample1.v19_5_l3.bp.p_ctg.gfa to disk...

And, the one for run with both haplotypes,

Writing reads to disk... Reads has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. Writing ma_hit_ts to disk... ma_hit_ts has been written. bin files have been written. [M::purge_dups] homozygous read coverage threshold: 28 [M::purge_dups] purge duplication coverage threshold: 36 [M::ug_ext_gfa::] # tips::64 Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... [M::purge_dups] homozygous read coverage threshold: 28 [M::purge_dups] purge duplication coverage threshold: 36 [M::mc_solve:: # edges: 730] [M::mc_solve_core_adv::0.063] ==> Partition [M::adjust_utg_by_primary] primary contig coverage range: [23, infinity] [M::break_ug_contig] break potential misassemblies with <2-fold coverage [M::break_ug_contig] break potential misassemblies with >200-fold coverage Writing sample2.v19_5_l3.bp.p_ctg.gfa to disk... [M::reduce_hamming_error_adv::0.259] # inserted edges: 4666, # fixed bubbles: 82 [M::adjust_utg_by_trio] primary contig coverage range: [23, infinity] [M::recall_arcs] # transitive arcs::106 [M::recall_arcs] # new arcs::40376, # old arcs::24834 [M::clean_trio_untig_graph] # adjusted arcs::0 [M::break_ug_contig] break potential misassemblies with <2-fold coverage [M::break_ug_contig] break potential misassemblies with >200-fold coverage [M::adjust_utg_by_trio] primary contig coverage range: [23, infinity] [M::recall_arcs] # transitive arcs::76 [M::recall_arcs] # new arcs::40334, # old arcs::24626 [M::clean_trio_untig_graph] # adjusted arcs::0 [M::break_ug_contig] break potential misassemblies with <2-fold coverage [M::break_ug_contig] break potential misassemblies with >200-fold coverage [M::output_trio_graph_joint] dedup_base::3603330, miss_base::0 Writing sample2.v19_5_l3.bp.hap1.p_ctg.gfa to disk... Writing sample2.v19_5_l3.bp.hap2.p_ctg.gfa to disk... Inconsistency threshold for low-quality regions in BED files: 70% [M::main] Version: 0.19.5-r587 [M::main] CMD: hifiasm -o sample2.v19_5_l3 -t 32 -l 3 --h-cov=200 --b-cov=2 --m-rate=0.75 ../sample2.ccs.fq.gz [M::main] Real time: 4222.222 sec; CPU: 118467.902 sec; Peak RSS: 20.844 GB

Could you help me why I do not get both haplotypes in many cases?

Thank you so much in advance,

chhylp123 commented 1 year ago

Could you please show me the whole log files? Thanks in advance.

ChaeheeLee commented 1 year ago

Thank you so much for the prompt response! I have attached the log file.

sample1.v19_5_l3_hap.log

chhylp123 commented 1 year ago

Thanks. By looking at this log file, it seems hifiasm has not been completed. The last line should be as follows:

[M::main] Real time: 4222.222 sec; CPU: 118467.902 sec; Peak RSS: 20.844 GB

ChaeheeLee commented 1 year ago

I agree with you! Do you have any idea why it stopped at that stage? The point where it stopped is same for all other samples with the same issue.

In one case when I changed --bo-cov to 2

the end of log file looked like this:

free(): invalid next size (normal)

chhylp123 commented 1 year ago

I see. As you have bin files, could you please rerun hifiasm with existing bin files without --h-cov=200 --b-cov=2 --m-rate=0.75? This will be helpful to know if these 3 options matter. I can inverstage it if you could share the bin files of one sample with me.

ChaeheeLee commented 1 year ago

I reran following your suggestion and now I see all files that we expect, including hap1 and hap2.

Below is the log fie.

Reads has been loaded. Loading ma_hit_ts from disk... ma_hit_ts has been read. Loading ma_hit_ts from disk... ma_hit_ts has been read. [M::ha_assemble::27.915*0.79] ==> loaded corrected reads and overlaps from disk [M::ha_opt_update_cov_min] updated max_n_chain to 205 [M::purge_dups] homozygous read coverage threshold: 40 [M::purge_dups] purge duplication coverage threshold: 51 [M::ug_ext_gfa::] # tips::30 Writing raw unitig GFA to disk... Writing processed unitig GFA to disk... [M::purge_dups] homozygous read coverage threshold: 40 [M::purge_dups] purge duplication coverage threshold: 51 [M::mc_solve:: # edges: 356] [M::mc_solve_core_adv::0.031] ==> Partition [M::adjust_utg_by_primary] primary contig coverage range: [34, infinity] Writing sample1.v19_5_l3.bp.p_ctg.gfa to disk... [M::reduce_hamming_error_adv::0.413] # inserted edges: 4702, # fixed bubbles: 76 [M::adjust_utg_by_trio] primary contig coverage range: [34, infinity] [M::recall_arcs] # transitive arcs::92 [M::recall_arcs] # new arcs::35346, # old arcs::19750 [M::clean_trio_untig_graph] # adjusted arcs::0 [M::adjust_utg_by_trio] primary contig coverage range: [34, infinity] [M::recall_arcs] # transitive arcs::102 [M::recall_arcs] # new arcs::35542, # old arcs::19880 [M::clean_trio_untig_graph] # adjusted arcs::0 [M::output_trio_graph_joint] dedup_base::37079973, miss_base::0 Writing sample1.v19_5_l3.bp.hap1.p_ctg.gfa to disk... Writing sample1.v19_5_l3.bp.hap2.p_ctg.gfa to disk... Inconsistency threshold for low-quality regions in BED files: 70% [M::main] Version: 0.19.5-r587 [M::main] CMD: hifiasm -o sample1.v19_5_l3 -t 32 -l 3 ../sample1.ccs.fq.gz [M::main] Real time: 291.824 sec; CPU: 640.591 sec; Peak RSS: 11.693 GB

If you think the bin files are helpful to figure it out, please let me know your email, so I can share. They are large.

Thanks again!

chhylp123 commented 1 year ago

I see. It should be a bug of these 3 options. Could you please share the bin files with me? Thanks in advance.

ChaeheeLee commented 1 year ago

Sure, would you like to see all three bin files or which one would you like to see? Thanks

chhylp123 commented 1 year ago

It would be better for me to get all three bin files.

ChaeheeLee commented 1 year ago

Okay, I will share them with you tonight! I guess I can share through your email address, hcheng@jimmy.harvard.edu? Thanks!

chhylp123 commented 1 year ago

Thanks. hcheng@ds.dfci.harvard.edu also works for me~