Numerous spurious contig overlaps in assemblies

dmacguigan commented 5 months ago

Hello,

Our lab has been using hifiasm for a number of genome assembly projects. We are using Nanopore reads (kit 14 chemistry, R10.4.1 flow cells) that have been error-corrected with PECAT or HERRO. Most of our hifiasm analyses have been with default settings. Overall, we're very impressed with the quality and contiguity of the hifiasm assemblies!

However, we have noticed a concerning trend. In all of our assemblies, there are numerous cases where the ends of contigs have large regions of (nearly) identical sequence.

Here's one example of a nucmer sequence alignment dotplot. Our hifiasm assembly is on the y-axis, and a chromosome-level assembly of a close relative is on the x-axis.

When we zoom in, we can see it's a large region of near perfect overlap, about 2 Mb in length.

image (1)

I did some local sequence alignments in this overlap region. We can see that the two contigs have nearly identical sequence in the region of overlap. The only regions where the sequences differ are in STRs.

image (2)

Although I'm only showing one example, these contig overlaps are frequent in our assemblies. We observe them in assemblies for several different species. And we observe them regardless of the read correction method (PECAT or HERRO).

Also, we have used HiC data to phase several genomes with hifiasm. But we still find these large contig overlaps in both hap1 and hap2 assemblies.

Furthermore, when we assemble the same error-corrected reads using other software (such as Flye), we do not observe the same contig overlap phenomenon.

We're concerned that these contig overlaps will be carried forward into our final scaffolded assemblies. Spurious duplicated regions could mislead our downstream analyses, such as gene family expansion.

Do you have any recommendations for how to address this contig overlap problem? Is this a known issue with using error-corrected Nanopore reads?

Thanks for your help, Dan MacGuigan

chhylp123 commented 5 months ago

Sorry for the late reply since I was quite busy this month. Hifiasm might be more sensitive to very small differences better reads, while traditional tools like Flye may decide to collapse them. The corrected ONT reads may still have recurrent sequencing errors, making hifiasm confusing. Is it possible that you can share some corrected datasets with me? Thanks so much!

kevfengler227 commented 4 months ago

These could be real, nearly identical, large tandem repeats. Unequal cross overs can create such events. The best way to check is to align the reads to the assembly and interrogate the coverage at the ends. For 30x HiFi, if the "overlapping" ends have 30x, it is a real duplication. As long as there a few differences between the copies, hifiasm is correctly teasing them apart, which is good.

kevfengler227 commented 4 months ago

alternatively, you could align your perch reads to the walleye reference and see if this 2Mb region has 60x coverage (which suggests a tandem repeat in perch) or 30x. If 30x, but heterozygous it is some phasing/purging issue.

kevfengler227 commented 4 months ago

if it is not a large, near identical tandem repeat. then it could just be phasing issue. if the haplotypes are very similar and they cannot be fully resolved you can end up with incomplete or mixed-up haplotype assignments like this where A and B overlap by 2 Mb and C and D have a 2Mb gap.

Here, moving contig B to hap2 and contig D to hap1 fixes the overlap between A and B and gap issue between C and D. I have to move contigs between haps all the time. Hopefully, your HiC data can validate that A and D are in the same haplotype and C and B are in the same haplotype, but I am guessing the HiC data is non informative here because the haplotypes are so similar, especially in the 2 Mb region. So moving these contigs will create two balanced and complete haplotypes, but may include a phase switch. Ideally, the ONT data would help here, but if the 2 Mb is very similar between both haplotypes even that may not resolve the phasing.

zy66-su commented 4 months ago

Hello,

Our lab has been using hifiasm for a number of genome assembly projects. We are using Nanopore reads (kit 14 chemistry, R10.4.1 flow cells) that have been error-corrected with PECAT or HERRO. Most of our hifiasm analyses have been with default settings. Overall, we're very impressed with the quality and contiguity of the hifiasm assemblies!

However, we have noticed a concerning trend. In all of our assemblies, there are numerous cases where the ends of contigs have large regions of (nearly) identical sequence.

Here's one example of a nucmer sequence alignment dotplot. Our hifiasm assembly is on the y-axis, and a chromosome-level assembly of a close relative is on the x-axis.

When we zoom in, we can see it's a large region of near perfect overlap, about 2 Mb in length.

I did some local sequence alignments in this overlap region. We can see that the two contigs have nearly identical sequence in the region of overlap. The only regions where the sequences differ are in STRs.

Although I'm only showing one example, these contig overlaps are frequent in our assemblies. We observe them in assemblies for several different species. And we observe them regardless of the read correction method (PECAT or HERRO).

Also, we have used HiC data to phase several genomes with hifiasm. But we still find these large contig overlaps in both hap1 and hap2 assemblies.

Furthermore, when we assemble the same error-corrected reads using other software (such as Flye), we do not observe the same contig overlap phenomenon.

We're concerned that these contig overlaps will be carried forward into our final scaffolded assemblies. Spurious duplicated regions could mislead our downstream analyses, such as gene family expansion.

Do you have any recommendations for how to address this contig overlap problem? Is this a known issue with using error-corrected Nanopore reads?

Thanks for your help, Dan MacGuigan

My situation may be similar to what you said, the plant genome I assembled had a chromosome that was itself an independent contig, and I found that there were 2-3mb tandem repeats at the contig tail (the part of the HIC heat map that has no signal). I detected a concentrated presence of telomere sequences (TTTAGGG) at both ends of this section (Chr1 blue arrow), and I have no way of telling if this is the result of an assembly error or a real presence.

kevfengler227 commented 4 months ago

The challenging part of this solving this mystery is that genomes do contain large duplications like this. We have validated many using BioNano maps. We are adding ONT to our assemblies to resolve and phase these duplications which often are associated with key traits. So in each case the question is whether it is real or an assembly artifact. Looking at the coverage is probably the easiest way. Aligning the HiFi or ONT reads to an assembly with just one copy of the duplicated/overlapping should reveal whether the region has 1 or 2x coverage. Moreover, the ONT data should reveal the novel junction reads for each copy. HiC data is not helpful because it relies on uniquely mapping reads, so having both copies in the assembly will be non-informative. Actually, the flye assembly graph should also indicate the 1 or 2x relative coverage in the gfa file, and also additional connections for the second copy. Some plant genomes are very well-behaved and do not exhibit this type of large duplications, but for others it is very common.

kevfengler227 commented 4 months ago

Also, several of our validated examples of large duplications like this are indeed on the ends of homeologous chromosomes. So you can validate with HiC by putting one copy in the assembly and see if you have HiC links to the ends of both chromosomes. But coverage is the indisputable way to confirm one way or the other.

zy66-su commented 4 months ago

Also, several of our validated examples of large duplications like this are indeed on the ends of homeologous chromosomes. So you can validate with HiC by putting one copy in the assembly and see if you have HiC links to the ends of both chromosomes. But coverage is the indisputable way to confirm one way or the other.

Thank you very much for reply! I may understand what you mean, you think that the large fragments of tandem repetition at the end may really exist. I only assembled HIFI and HIC data, and the HIFI sequence mapping results showed that the 2-3mb large tandem repeat region at the tail had 95% coverage. I assembled a forest genome with a size of 470mb (2n=30). Half of the chromosomes in this genome have 2-3mb large tandem repeats at the tail, which makes me confused. I agree that there may be tandem repeat sequences at the tail, but I am more skeptical about such a long sequence. I am thinking that there is a large segment of tandem duplication at the end of the chromosome, which causes "Sample. hic. pc_ctg.gfa" to not terminate in time at the end of the chromosome, resulting in the assembly of the tandem duplication segment being longer than it actually is? Is this also the reason why telomere sequences appear at both ends of 2-3mb tandem repeats? I plan to add '--b-cov 15 --h-cov 120 --m-rate 0.75' to finely divide the complex area, and merge hap1 and hap2 for further confirmation of hic mounting.

kevfengler227 commented 4 months ago

I was talking about X coverage, not % Coverage. If you did a 30x HiFi assembly, then if you align the reads to an assembly containing only 1 copy of the repeat you can check the X coverage. 30x = 1 copy, 60x = 2 copies. The GFA file from flye would show the X coverage too. Both copies may contain telomeric repeats on the ends if they both reside on the ends of different chromosomes. All HiFi assemblers will collapse large nearly identical repeats into a single copy.

zy66-su commented 4 months ago

I was talking about X coverage, not % Coverage. If you did a 30x HiFi assembly, then if you align the reads to an assembly containing only 1 copy of the repeat you can check the X coverage. 30x = 1 copy, 60x = 2 copies. The GFA file from flye would show the X coverage too. Both copies may contain telomeric repeats on the ends if they both reside on the ends of different chromosomes. All HiFi assemblers will collapse large nearly identical repeats into a single copy.

I'm sorry for the late reply. The result you said is correct. I assembled the two haplotype genomes of hifiasm and assembled it by flye, both of which support this result. The coverage depth of tandem repeats in the tail of the other two haplotype genomes is similar, thank you!

chhylp123 / hifiasm

Numerous spurious contig overlaps in assemblies #660