chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
555 stars 88 forks source link

Understanding the complex topologies of unitigs #717

Closed OceanLyu closed 3 weeks ago

OceanLyu commented 1 month ago

Hi, Thanks for developing hifiasm!

I`m working on a diploid genome using only ccs reads by default hifiasm parameters. I got an assembly with contig N50 ~15 M and the longest primary unitigs is like this:

image image

As you can see there are bizarre structures like tips and 'multi-layer bubbles' in those bubbles, which I suspect arisied from repeated sequences because the genome is diploid. I do not know if I am correct.

Meanwhile, there is a significant number of short contigs in the primary assembly (shown by contig quast plot below). image

And as you mentioned in the paper, during bubble poping and generation of the primary assembly, tips not coming from the different haplotypes would be preserved. I used repeatmodeler and repeatmasker to predict repeat sequences in the primary assembly and observed dispeared repeat sequences covering the entire short contigs.

Thus I wonder if those short contigs from cut tips that come from repeated sequences can be removed (e.g. filtered by length) when doing scaffolding for instance.

The final question is whether the contig N50 is long enough? with estimated ~2.6 G genome length.

The ploidy was predicted by Smudgeplot (shown below) and preliminary karyotype experiments. image

And the ccs reads kmer distribution by Genomescope2 is shown below. image

Thanks in advance for your help!

chhylp123 commented 3 weeks ago

It is hard to say. Some genomes are easy to assemble, while some others are not. In addition, the read coverage will also affect the final assembly. All of these will affect the N50.

These small tips might be repeats, or local polypoid regions. If they are too short, probably nothing hifiasm can do currently.

OceanLyu commented 3 weeks ago

Thanks very much to your reply!