chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
529 stars 86 forks source link

Ultra Long intergration failed: no output for UL kmer counting #643

Closed mvolar closed 4 months ago

mvolar commented 5 months ago

Hello,

using hifiasm:

./hifiasm/hifiasm -o confusum_ont_assembly.asm -t8 -l0 --ul confusum_ont_20k_filt.fasta confusum_hifi.fasta 2> logging.log

I get the attached log. Previously we have managed to run hifiasm on multiple PacBio only data, for similar genome sizes, for this species we have decided to use UL integration as we have the data. However the assembly fails at the UL integration step.

The worrysome part is the end of logging, where it looks like that the UL reads have not even been read into the assembly process, i.e. it finds the correct number of reads (~270k, but no bases):

[M::ul_realignment::] ==> starting UL
[M::ha_opt_update_cov] updated max_n_chain to 250
[M::gen_cov_track::] # bases: 0
[M::ha_ct_shrink::13963.061*7.57] ==> counted 0 distinct minimizer k-mers
[M::ha_ft_ul_gen::13963.279*7.57@30.716GB] ==> filtered out 0 k-mers occurring 32 or more times
[M::yak_count] collected 0 minimizers
[M::ha_pt_ul_gen::13963.280*7.57] ==> counted 0 distinct minimizer k-mers
[M::ha_ct_shrink::13963.282*7.57] ==> counted 0 distinct minimizer k-mers
[M::yak_count] collected 0 minimizers
[M::ha_pt_ul_gen::13963.283*7.57] ==> indexed 0 positions
[M::uidx_l_build] Index has been built.
[M::uidx_write] Index has been written.
[M::ha_opt_update_cov] updated max_n_chain to 250
[M::cal_graph_ovlp_binning::0.003] ==> Qualification
[M::write_emask_t] Index has been written.
[M::rescall_ul_pipeline::99.804] ==> Qualification
[M::rescall_ul_pipeline::] ==> # reads: 273036, # bases: 0, # fully corrected reads: 0
[M::write_all_ul_t] Index has been written.
[M::print_integert_ovlp_stat::] # UL reads::0, # UL ovlps::0
[M::print_integert_ovlp_stat::] # UL reads::0, # UL ovlps::0

However, the reads appear to be normal FASTA reads, and the output of seqtk comp is here:

seqtk comp confusum_ont_20k_filt.fasta  | head -n 20
02de5154-8d45-4439-ad98-c844f90f13f0    26601   5941    7754    6031    6875    0       0       0       4028    0      0                  0
03411eee-471c-4df5-8e9e-00fb18efaa22    23627   5620    6925    5220    5862    0       0       0       3356    0      0                  0
0414efcc-a0f9-4b1e-918a-c164abdf06ad    21522   5166    6220    4537    5599    0       0       0       2816    0      0                  0
06ad0f02-2edd-479b-8710-9311be09988c    31979   8114    8971    6872    8022    0       0       0       4034    0      0                  0
00d2703f-3cda-4570-a867-89571861f9dc    21054   4138    6250    5195    5471    0       0       0       3382    0      0                  0
010365e0-021c-4baf-8adf-12783dac010c    20623   4329    5839    5201    5254    0       0       0       3214    0      0                  0
05198ecc-f4db-42b0-9b22-86aca5ab3fbb    22125   5776    6155    4862    5332    0       0       0       3046    0      0                  0
07a9ca65-0065-4121-ac31-b3c099a57172    29076   6258    7725    7254    7839    0       0       0       3772    0      0                  0
02b70eaa-ce92-4c22-993c-e0f2ebcba99d    35233   8174    10491   7832    8736    0       0       0       5360    0      0                  0
042d2fe8-60fe-4dde-ab5c-2fc0334e6a33    26699   5787    7210    6281    7421    0       0       0       3162    0      0                  0
042ebcd8-e193-498b-b410-2223aa90b5b3    23449   5363    6587    5355    6144    0       0       0       3246    0      0                  0
02c0588d-6656-440b-a953-1b0672a234b8    32398   7193    9410    7381    8414    0       0       0       4746    0      0                  0
0645f4d3-6d26-4850-9ad2-935b4e75da50    21607   4940    6345    4818    5504    0       0       0       3344    0      0                  0
0b5d35a4-ff29-425c-9464-11fac55272ef    44703   10039   13318   10030   11316   0       0    0       7062    0      0                  0
11dec1f4-627b-4cb2-aefa-f675f23e46c3    25511   6299    7903    4159    7150    0       0       0       1226    0      0                  0
1430b9c5-309b-4a13-8406-bcf6e616f8ef    25135   5454    7368    5665    6648    0       0       0       3580    0      0                  0
15612f69-5a01-41d1-b632-874654b371cb    24738   5892    6547    5217    7082    0       0       0       2030    0      0                  0
159f327b-0eaa-4b48-84d7-d66153ef8a5b    20282   4463    5568    4943    5308    0       0       0       2460    0      0                  0
159633c9-169e-40b3-a865-e5b179b5179c    34996   9152    9565    5823    10456   0       0       0       1716    0      0                  0
143c8c50-cbf9-4740-9e93-bd3f23bd0d90    20903   4853    4986    4750    6314    0       0       0       1490    0      0                  0

Logging of the assembly process:

logging.log

chhylp123 commented 4 months ago

@mvolar sorry for the late reply. Something might be wrong with confusum_ont_20k_filt.fasta since hifiasm cannot see base pairs. In addition, by looking at the k-mer plot, the HiFi reads also look not good. It would be better to double check both of them.

mvolar commented 4 months ago

Yeah it was the fasta file, franky I don't know what happened to it, I have redownloaded it and everything worked out fine, producing a decent N50 in the assembly. Although UL integration joined 2 chromosomal arms over the centromere in a single conting, producing a chromosome 90mb in size (our whole genome is cca 300mb) but everything else worked fine.

chhylp123 commented 4 months ago

That's great!