assembling a highly heterozygous plant

badplantgeek commented 1 year ago

Hi. I have been trying to assemble the genome of a plant species that has a high rate of heterozygosity. The genome size is around 2.8g, and I have about 40X coverage of HiFi reads and HiC.

I tried over 30 combinations of hifiasm parameters and I can't get the contiguity up, number of contigs down, with the correct genome size. Here is the k-mer plot output from hifiasm l3_run.txt.

The table below shows the parameters I tested so far. Is there anything that I am missing? Any suggestions on how I could improve this assembly? Any help would be much appreciated!

Parameters	N50	# contigs	genome size hap1
--primary --l2 --n-hap 2	11208698	7810	4492480398
--primary --l2 --n-hap 4	11444665	7969	4257999532
--primary --l2 --n-hap 3	11582156	7973	4432899259
--primary --l2 --n-hap 2 --hg-size 2.8g	2369348	7176	1061205025
--primary --l2 --n-hap 2 -D 10	11208698	7809	4492528825
--primary --l2 --n-hap 4 --hg-size 2.8g	11208698	12337	4492528825
--primary --l2 --hom-cov 32 --hg-size 3g	11208698	7810	4492480398
--primary --hom-cov 33 -s 0.1	9064087	9178	4627444839
--primary --l2 -s 0.75	11208698	7810	4492480398
--primary --l2 --purge-cov	5469586	9327	2961317496
--primary --l2 -s 0.1	5604130	9179	2967512071
--primary --l2 -s 0.2	5597995	8745	2778259190
--primary -l3	6638195	7822	2164743314
--primary -l3 --hg-size 2.8	2584664	7938	1320354165
--primary -l3 --n-hap 4	8952511	11264	6618029629
--primary -l3 -s 0.1	10466999	9118	5581127764
--primary -l3 -D 10	6638195	7822	2164743314
--primary -l3 --primary -l3 --hg-size 3g --hom-cov 32 --n-hap 2	8989502	12006	6918919875
--primary -l3 --primary -l3 --hg-size 3g --hom-cov 32 -D 10	8989502	12006	6918843924
--primary -l3 --primary -l3 --n-hap 3	8839068	11275	6607018238
--primary -l3 --primary -l3 --hg-size 2.8g -s 0.1	10479801	8928	5216611048
--primary -l3 -s 0.2	5310273	8997	2868864197
--primary -l2 -s 0.75	7398995	7243	1983228607
--primary -l3 --hom-cov 32 -s 0.4	8989503	11493	6836955766
--primary -l3 -s 0.3 -hom-cov 32	10078153	8660	4504118024
--primary -l1 --hg-size 2.8g	6794	2478164	1042039281
--primary -l1 --n-hap 4	7571	11444665	4241695685
--primary -l1 -s 0.1	6751	6262460	2814782205
--primary -l1 -D 10	7558	11219601	4483659495

chhylp123 commented 1 year ago

Could you please also show the size of hap2?

badplantgeek commented 1 year ago

Hi Haoyu

Thanks for your quick reply. I don't have the hap2 size for all, but I have for a few. See below.

			hap1	hap 2
primary	l1	s 0.2	2658241726	5493102674
primary	l1	n-hap 4	4241695685	6832817032
primary	l1	s 0.1	2814782205	5254045602
primary	l1	D 10	4483659495	6838537694

primary	l2	s 0.1	6422772254	2152964394

primary	l3	s 0.1, hg-size 2.8g	5216611048	2415842962
primary	l3	s 0.2, hg-size 2.8g	5382089003	2243043410
primary	l3	s 0.3, hg-size 2.8g	2117668233	5543717930

chhylp123 commented 1 year ago

The assemblies of both haplotypes are much larger than 2.8g*2. How do you know the estimated genome size? I am wondering if there are containments as the k-mer plot of your data is also a little bit weird. A good plot should like this: https://github.com/chhylp123/hifiasm/issues/10#issuecomment-616213684.

badplantgeek commented 1 year ago

I agree that the k-mer plot is not ideal. We estimated the genome size with flow cytometry, so I am pretty certain it is around 2.8g.

If there is contamination, I would imagine it is from the same species, perhaps from two different individuals. Since the heterozygosity is high in this species, hifiasm is assembling contigs separately that are in reality the same region of the genome. Is there a parameter that I can use to "relax this merging"? I looked into purge_dups and thought about relaxing the -a parameter. Any thoughts? Any suggestion would be much appreciated!

chhylp123 commented 1 year ago

Well, we haven't seen this case before. Probably run purge_dups to get a clean reference , and then map hap1/hap2 for debugging? It is not easy for hifiasm itself to handle this issue automatically.

chhylp123 / hifiasm

assembling a highly heterozygous plant #330