chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
505 stars 84 forks source link

partial contigs with half-reduced coverage ccs reads is misassembled? No, by contrast, it has assembled very long and novel tandem repeat sequeces #50

Open mydongan opened 3 years ago

mydongan commented 3 years ago

Hi chhylp123, I have sequenced a diploid genome (repeat content >70%) with 25X coverage HiFi reads. Luckly, I got a wonderful contigs with N50 of 44 Mb by hifiasm 0.12.

Then I anchored contigs to chromosomes by Allmaps with ~1500 high quality genetic markers. Finally, I obtained 10 pesudomolecules. However, I found there was 20-MB region in chr7 which is not supported by genetic markers and synteny with other homologus species. image image

Furthermore, I mapped ccs reads to the final assembly, and I found that 20-Mb region with half-reduced coverage reads. image

Meanwhile, I also mapped the RNA-seq reads to the genome, and no reads covered this region. So, I think this 20-Mb region maybe misassembled.

However, this 20-Mb region was located in a single contig (108 Mb) which were constructed by sereval utgs (the length of both terminal utgs (utg000064l and utg000017l) are 29 Mb and 47 Mb, separately ), and there is no obvious evidence support to break this contig. image

Therefore, I am wondering whether there are other probabilities for this assembly? And have you ever met that some assembly regions covered by half depth reads before? May be high heterozygosity for 20-Mb?

Thanks!

Dong An

chhylp123 commented 3 years ago

Could you please zoom in the utg graph around the this 20-Mb region? I'd like to see how the subgraph looks like. Also, could you please show the following numbers at hifiasm log?

[M::ha_pt_gen] peak_hom: []; peak_het: [] [M::purge_dups] purge duplication coverage threshold: []

mydongan commented 3 years ago

Thanks! I aligned 5 Mb sequence of 20-Mb region to all utgs fasta sequences, and I found it mapped to the utg000017l (47M). image

image

The following information of hifiasm log are listed as below: [M::ha_pt_gen] peak_hom: 25; peak_het: -1 [M::purge_dups] purge duplication coverage threshold: 31

lh3 commented 3 years ago

Based on the mapping of genetic markers, can you assign this 20Mb to other chromosomes?

mydongan commented 3 years ago

Thank you! Dr Li. Very strange, this 20 Mb region did not have any genetic markers.

lh3 commented 3 years ago

A few more things to try:

mydongan commented 3 years ago

Thank you very much for your suggestions!

lh3 commented 3 years ago
chhylp123 commented 3 years ago
mydongan commented 3 years ago

Thanks all !

Yes, it is a inbred haploid, het is 0.232% when I did survey analysis, and I assembled the genome using "-l0".

After doing repeat annotation, 85% of this region was annotated as 180-bp knob repeat which is a specific tandem repeat in plants. image Therefore, this region has not been assembled by previous studies, and thus proved that HIFI reads and hifiasm are very efficient and accurate for assembly long tandem repeats. Thank you all again! Furthermore, I do nucmer alignment using utg000017l and itself, an we can also seen the terminal 11 Mb are tandem repeat. image

However, I still not understand why the ccs reads coverage reduced half in this region.

lh3 commented 3 years ago

As someone was referring to this issue, I have reread the thread. I am seeing:

If this description is right, this is not a contig misassembly. You have an inbred diploid genome. One possibility is that this region is diverged between the two haplotypes although the rest of the genome is nearly homozygous. The solution is to remove the diverged copy from the primary assembly. By the way, when you scaffolded the contigs, have you discarded prefix.a_ctg.gfa?

mydongan commented 3 years ago

Maybe you are right, this repeat region with half coverage may be divergence rapidly between the two haplotypes. Yes, I only use prefix.p_ctg.gfa for further assembly.