chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
517 stars 85 forks source link

How to improve the QV of the hifiasm assembly? #513

Open zmz1988 opened 1 year ago

zmz1988 commented 1 year ago

Dear prof. Heng Li,

Thank you so much for the tool! I have been using hifiasm since 2021, and am a big fan of it. I recently run into different results from different versions of hifiasm, and would like to ask your opinion in solving the problem, if you don't mind. So my situation is as below:

Recently I assembled all of my old data with the new hifiasm v0.19.6. The data includes hifi (~25 x) plus Nanopore (50k or 30k length, >50x). Then I compared the resulted assemblies with my old assemblies assembled by hifiasm v0.16.5 (hifi reads only).

I found that QVs (k-mer based QV calculated by Merqury, k-mer data base calculated from hifi reads) are much lower in the new assemblies (e.g. 60 in the old vs. 45 in the new), despite the superb NG50 values of the new assemblies. The completeness (k-mer based calculated by Merqury) is also relatively low (e.g. 98% in the old vs. 96% in the new, some new ones even have completeness 75%).

I'm not sure why this happen, as the data for both trials are the same. I mean some parts of the new assemblies are from Nanopore reads, which may cause inconsistency with the hifi reads and lower the QV. But I don't know how to explain the lower completeness (especially 70% ones) in the new assemblies.

I would prefer to use the newly assemblies by hifiasm v0.19.6 for our publication, because they are gapless from telomere to telomere. So my question is: whether we have a way to improve the QV? Would you recommend two runs of racon with hifi reads to polish the assemblies, though I know it's recommended not to polish the final assemblies from hifiasm?

I'm very sorry for writing such a long post. Thank you very much in advance!

Some more info of the assemblies: Species: Arabidopsis thaliana Code: hifiasm -o ${sample} -t 10 --ul ${sample}_ont_50k_reads.fq.gz --ul-rate 0.2 --primary ${hifi_fastq_file} &> ${sample}_hifiasm.log

chhylp123 commented 1 year ago

As there are higher coverage of ONT reads, the lower QV values of new hybrid assemblies might be caused by some ONT-only regions. Could you please have a try to polish these new assemblies and see if completeness could be improved?

zmz1988 commented 1 year ago

Thanks a lot for replying me so fast! I found some error in my completeness calculation. So the 70% completeness is due to the wrong file used for data base building, and now everything has been corrected.

So most of the assemblies resulted from hifiasm have 94-98% completeness and QV around 40-53 calculated by merqury. After one run of hifi polishing by racon, the completeness only increased max 0.1, and some assemblies even have a bit lower completeness. More or less the same situation for QV. For BUSCO values, most of the complete value are above 99%, except two assemblies which have complete values around 92 or 97%. The polishing step doesn't improve the BUSCO value either.

I checked the coverage of those assemblies with relatively low BUSCO, and noticed that HiFi coverage (16x and 22x) is pretty low but Nanopore coverage is ok (31x and 60x). So I guess probably the very low coverage of hifi data for these two assemblies is the reason for low completeness?

chhylp123 commented 1 year ago

Yes, I guess so.