chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
529 stars 86 forks source link

chimeras in long SSRs #674

Open kevfengler227 opened 2 months ago

kevfengler227 commented 2 months ago

I am trying to figure out the best way to handle chimeric assembly errors that can occur in long SSRs in HiFi only assemblies for high-throughput applications.

In the example below, there is (TAC)n repeat that is ~11 kb in length in the contig assembly. However, it is chimeric assembly with another long (TAC)n repeat elsewhere in the genome. I have extracted the reads from the noseq.gfa file that were used to build this contig and aligned the corresponding corrected reads to the assembly. The TAC(n) region is denoted in blue. None of the selected reads from tiling path span the repeat. Also, there are a few consensus errors in the reads.

I am guessing that increasing the --b-cov, --h-cov, and --m-rate have helped break this contig? If so, probably many others too that were OK.

I am wondering if there is a way to specifically address this type of issue? Either by ensuring that SSRs or homopolymers contain spanning reads or by breaking them to prevent chimeras (eg break (TAC)n repeats >10 kb).

On a side note I think resolving this type of issue (break or span) is an important consideration for T2T assemblies. I think it misleading to call a contig or assembly T2T with unspanned repeats, which if not spanned are of indeterminate length. Plant genomes can contain very long SSRs >300 kb which can easily be collapsed or chimeric and nevertheless, be called T2T.

image