chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
551 stars 88 forks source link

Trio binning - high error rates and significant assembly size differences #709

Open BrianSmart opened 1 month ago

BrianSmart commented 1 month ago

Hello!

I recently ran hifiasm in trio-binning mode on sunflower samples. We used two lines that are thought to be largely isogenic beyond an anthocyanin/male fertility locus, and crossed them to get a "heterozygous" line. We did Illumina short read "parental" sequencing on the two parents (~20x coverage) and PacBio Revio long read sequencing on the heterozygous line (~50x coverage). I then used the following hifiasm command: hifiasm -o Sunflower_1_Hetero_Red_trioBinning.asm -t 128 -1 Sunflower_2_Homo_Red.yak -2 Sunflower_3_Homo_Green.yak ../MutagenesisPacBioLongReadsMerged.fastq.gz

The resulting hap1 and hap2 p_ctg outputs were then used for assembly. Gfastats, BUSCO and Merqury show the following: Haplotype 1: Total length: 718.08 Mb​ Scaffold N50: 1.12 Mb​ Largest scaffold: 32.20 Mb​ Number of scaffolds: 4,129​ GC content: 39.72% BUSCO: 23.2% complete Merqury: 25.5% complete, QV score 58.2

Haplotype 2: Total length: 2.97 Gb​ Scaffold N50: 92.20 Mb​ Largest scaffold: 202.28 Mb​ Number of scaffolds: 727​ GC content: 38.70% BUSCO: 95.7% complete Merqury: 96.5% complete, QV score 63.7

The main concern I have about these results before proceeding with publication is the size difference between the haplotypes, and the switch, hamming, and error rates being: Trio Hap1: 11.84% 14.89% 18.02% Trio Hap2: 21.74% 26.82% 27.82%

Is this level of size difference between haplotypes normal for highly similar parents? How can I determine if the high error rates are due to similarity or actual errors? Are there any additional analyses you'd recommend to validate these assemblies?

The main reason I'm not totally worried is simply because these two haplotypes should be largely identical, so size differences and high error rates might be expected. Perhaps the high error rates just indicate the high sequence similarity?

Thanks for this fantastic program! The resulting scaffolds from YaHS using the OmniC data look fantastic regardless of these concerns.


For reference, the genomescope2 summary for the HiFi reads is: GenomeScope version 2.0 input file = meryl_hifi_kmers_k21_histogram.txt output directory = . p = 2 k = 21 property min max
Homozygous (aa) 89.6889% 93.8683%
Heterozygous (ab) 6.13174% 10.3111%
Genome Haploid Length 1,538,516,725 bp 1,543,677,918 bp
Genome Repeat Length 1,217,987,977 bp 1,222,073,907 bp
Genome Unique Length 320,528,748 bp 321,604,011 bp
Model Fit 28.0299% 82.9%
Read Error Rate 0.105375% 0.105375%