Larger assembly size than expected

jacopoM28 commented 1 year ago

Hello Hifiasm community and developers,

I am working with a diploid insect genome that was sequenced on a Sequel II platform (no Hi-C data available for now, but we are working on obtaining it). According to Genomescope results, the estimated haploid genome size should be around 380Mb, and congeneric species are expected to have a genome size ranging from 270 to 350Mb.

Interestingly, the assembly size of the primary contigs resulting from a default Hifiasm run is much larger than expected (530Mb). Also phased contigs have similar sizes (499 Mb and 437Mb). However, both the contig N50 and Busco results of primary contigs look promising (N50: 9Mb, Busco: 96.6 [S:95.7%,D:0.9%],F:0.5%,M:2.9%,n:5991). Additionally, the k-mer spectra obtained with KAT appear also satisfactory to me:

Fpar_KmerSpectra-main mx spectra-cn

I suspect that Genomescope might be underestimating the genome size, possibly due to a recent transposable element expansion. Indeed hifiasm is already correctly identifing the heterozigous and homoziogus peaks (27 and 54) and there is a high repetitive content of around 60% as well as the presence of a 'shoulder' of highly repeated kmers, both in the k-mer histogram and the k-mer spectra plot. Am I interpreting these results correctly? If so, even though I believe the results are already good, do you have any suggestions for parameter tuning in Hifiasm, considering the absence of Hi-C data?

Thanks in advance!

Jacopo

chhylp123 commented 1 year ago

Yes, I also feel like hifiasm might be correct and Genomescope might underestimate the genome size. If you really want to make the assembly smaller, please see: https://hifiasm.readthedocs.io/en/latest/faq.html#how-can-i-tweak-parameters-to-improve-hi-c-integrated-assembly.

qdu-beep commented 11 months ago

Yes, I also feel like hifiasm might be correct and Genomescope might underestimate the genome size. If you really want to make the assembly smaller, please see: https://hifiasm.readthedocs.io/en/latest/faq.html#how-can-i-tweak-parameters-to-improve-hi-c-integrated-assembly.

Hello, dear developers,

I encountered a similar yet somewhat different problem. I assembled a diploid genome using HiFi and Hic data, and its size is much larger than estimated based on kmer (about 2.3Gb). According to the genome survey results, the heterozygosity rate of the genome is very low (0.127%) and the duplication rate is 52.5%. The primary assembly (3.7Gb) is similar in size to the two haplotype assemblies (the default parameters and the "-s 0.45" parameter are similar).

Busco results of primary contigs: C:98.6%[S:94.3%,D:4.3%],F:0.1%,M:1.3%,n:5950 Assembly information of primary contigs : Number of contigs: 939 Total bases: 3,706,033,829 bp Max length: 252,188,764 bp Average : 3,946,787 bp Contig N50: 96,520,916 bp However, the KAT results were less than ideal and there were many duplicate kmers in the assembly. kat_test-main mx spectra-cn

Importantly, based on some similar issues, I checked some information in the log files and I think hifiasm correctly identified the peak positions. I appreciate your patience and assistance in providing me with any suggestions or ideas！@chhylp123

Some key information is as follows: "peak_hom: 20; peak_het: -1" "[M::purge_dups] homozygous read coverage threshold: 20 [M::purge_dups] purge duplication coverage threshold: 25" " # heterozygous bases: 510816440; # homozygous bases: 3571296092" nohup.txt

jacopoM28 commented 7 months ago

Dear @chhylp123,

I'm writing to provide some updates about the assembly. Upon closer inspection, I discovered that a significant portion of the genome (40%) is composed of a single satellite family organized in extremely long tandem arrays (up to 10Mb). However, based on the raw reads, the same satellite DNA appears to cover a significant smaller proportion of the genome (24%). Upon examining the mapping of the HiFi reads back to the genome, it seems that these tandem arrays have lower coverage compared to the flanking regions.

Tandem_Repeats

While I anticipate low mappability across tandem arrays of satellite DNA, considering the larger assembly size than expected, is it possible that Hifiasm is artificially extending ONLY these regions due to high haplotypic variability (i.e., including both haplotypes in a tandem-like fashion)?

Thank you again for your assistance.

Best regards, Jacopo

chhylp123 commented 7 months ago

Hi @jacopoM28, are you using haplotype-resolved assemblies or the primary assembly? What is the average coverage for these problematic regions?

jacopoM28 commented 7 months ago

I am using the primary contigs. Haplotype-resolved assemblies are smaller compared to the primary assembly, as you can read on the first post, but Busco scores are about 2-5% lower. Because we don't have Hi-C data from the same individual used for the PacBio sequencing to improve the phasing, we decided to rely only on the primary contigs.

The median coverage across the entire genome is 42.82, across the tandem repeat regions is of 28.67, and across the whole genome after excluding the tandem repeats is 52.94.

Thank you again for your time!

Jiseon623 commented 6 months ago

Dear @jacopoM28

Hello, I'm experiencing a similar issue. Repetitive regions seem to be assembled longer than their actual length. Have you found a solution for this extension problem? I'm curious about options you used.

Thanks in advance. Jiseon

chhylp123 / hifiasm

Larger assembly size than expected #548