chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
551 stars 88 forks source link

Interchromosomal misjoin #651

Open Han-Cao opened 6 months ago

Han-Cao commented 6 months ago

Hi,

I am using hifiasm 0.19.8-r603 to do human genome assembly using hifi-only data. After the assembly, I found there is a interchromosomam misjoin reported by paftools.js misjoin -c centromeres.bed -e align_to_chm13.paf.

To troubleshoot the assembly, I aligned the hifi reads to the assembly and check the misjoin position in IGV. In the IGV view, the left part of assembly was aligned to chr10, and the right part of the assembly was aligned to chrX.

Moreover, the region within the blue window has a lot of mismatches and double depth, does it mean that region is a fusion from 2 chromosomes? If I want to manually fix this issue, can I split the assembly into 2 regions: one end with the left blue line, another one start from the right blue line, and drop the region within the blue lines?

Thank you!

image

Han-Cao commented 6 months ago

Hi,

I did more research on this case, and I realized the interchromosomal misjoin could occurs within the black window. I randomly selected one position within the black window and blat the flanking 10kb sequencing to CHM13. The blat result shows the left 5kb sequence is mapped to chr10, while the right 5kb sequencing is mapped to chrX.

Do you think this is correct way to fix the misjoin issue? Besides, would it possible to avoid such error during assembly?

Thank you!

image

kevfengler227 commented 4 months ago

Perhaps the read that was used to build this contig is not begin shown. There should be a read which perfectly matches the assembly. Are you filtering secondary/supplemental alignments in IGV. Are you showing alignments with mapping quality 0?

Another way to troubleshoot is to look into the asm.bp.p_ctg.noseq.gfa file and find the read that used in the tiling path at this location and see why it is not mapping here. I am guessing there bogus or repetitive read that use to create this contig that is not being shown in the visualization.