chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
504 stars 84 forks source link

assembling a highly heterozygous plant #330

Open badplantgeek opened 1 year ago

badplantgeek commented 1 year ago

Hi. I have been trying to assemble the genome of a plant species that has a high rate of heterozygosity. The genome size is around 2.8g, and I have about 40X coverage of HiFi reads and HiC.

I tried over 30 combinations of hifiasm parameters and I can't get the contiguity up, number of contigs down, with the correct genome size. Here is the k-mer plot output from hifiasm l3_run.txt.

The table below shows the parameters I tested so far. Is there anything that I am missing? Any suggestions on how I could improve this assembly? Any help would be much appreciated!

Parameters N50 # contigs genome size hap1
--primary --l2 --n-hap 2 11208698 7810 4492480398
--primary --l2 --n-hap 4 11444665 7969 4257999532
--primary --l2 --n-hap 3 11582156 7973 4432899259
--primary --l2 --n-hap 2 --hg-size 2.8g 2369348 7176 1061205025
--primary --l2 --n-hap 2 -D 10 11208698 7809 4492528825
--primary --l2 --n-hap 4 --hg-size 2.8g 11208698 12337 4492528825
--primary --l2 --hom-cov 32 --hg-size 3g 11208698 7810 4492480398
--primary --hom-cov 33 -s 0.1 9064087 9178 4627444839
--primary --l2 -s 0.75 11208698 7810 4492480398
--primary --l2 --purge-cov 5469586 9327 2961317496
--primary --l2 -s 0.1 5604130 9179 2967512071
--primary --l2 -s 0.2 5597995 8745 2778259190
--primary -l3 6638195 7822 2164743314
--primary -l3 --hg-size 2.8 2584664 7938 1320354165
--primary -l3 --n-hap 4 8952511 11264 6618029629
--primary -l3 -s 0.1 10466999 9118 5581127764
--primary -l3 -D 10 6638195 7822 2164743314
--primary -l3 --primary -l3 --hg-size 3g --hom-cov 32 --n-hap 2 8989502 12006 6918919875
--primary -l3 --primary -l3 --hg-size 3g --hom-cov 32 -D 10 8989502 12006 6918843924
--primary -l3 --primary -l3 --n-hap 3 8839068 11275 6607018238
--primary -l3 --primary -l3 --hg-size 2.8g -s 0.1 10479801 8928 5216611048
--primary -l3 -s 0.2 5310273 8997 2868864197
--primary -l2 -s 0.75 7398995 7243 1983228607
--primary -l3 --hom-cov 32 -s 0.4 8989503 11493 6836955766
--primary -l3 -s 0.3 -hom-cov 32 10078153 8660 4504118024
--primary -l1 --hg-size 2.8g 6794 2478164 1042039281
--primary -l1 --n-hap 4 7571 11444665 4241695685
--primary -l1 -s 0.1 6751 6262460 2814782205
--primary -l1 -D 10 7558 11219601 4483659495
chhylp123 commented 1 year ago

Could you please also show the size of hap2?

badplantgeek commented 1 year ago

Hi Haoyu

Thanks for your quick reply. I don't have the hap2 size for all, but I have for a few. See below.

hap1 hap 2
primary l1 s 0.2 2658241726 5493102674
primary l1 n-hap 4 4241695685 6832817032
primary l1 s 0.1 2814782205 5254045602
primary l1 D 10 4483659495 6838537694
primary l2 s 0.1 6422772254 2152964394
primary l3 s 0.1, hg-size 2.8g 5216611048 2415842962
primary l3 s 0.2, hg-size 2.8g 5382089003 2243043410
primary l3 s 0.3, hg-size 2.8g 2117668233 5543717930
chhylp123 commented 1 year ago

The assemblies of both haplotypes are much larger than 2.8g*2. How do you know the estimated genome size? I am wondering if there are containments as the k-mer plot of your data is also a little bit weird. A good plot should like this: https://github.com/chhylp123/hifiasm/issues/10#issuecomment-616213684.

badplantgeek commented 1 year ago

I agree that the k-mer plot is not ideal. We estimated the genome size with flow cytometry, so I am pretty certain it is around 2.8g.

If there is contamination, I would imagine it is from the same species, perhaps from two different individuals. Since the heterozygosity is high in this species, hifiasm is assembling contigs separately that are in reality the same region of the genome. Is there a parameter that I can use to "relax this merging"? I looked into purge_dups and thought about relaxing the -a parameter. Any thoughts? Any suggestion would be much appreciated!

chhylp123 commented 1 year ago

Well, we haven't seen this case before. Probably run purge_dups to get a clean reference , and then map hap1/hap2 for debugging? It is not easy for hifiasm itself to handle this issue automatically.