Hifiasm doesn't find heterozygous kmer peak correctly - consequently HiC mode fails

conchoecia commented 3 years ago

Hi Haoyu,

This is related to #55, where I noticed that hifiasm was not correctly identifying the heterozygous and homozygous peaks. The animal that I'm working with is very heterozygous (>4%), so the kmer spectrum has a very large het peak, and the hom peak is much smaller. I know that the animal is not polyploid, as I already have generated a diploid, chromosome-scale assembly for this individual.

I ran the new version of hifiasm with these parameters:

hifiasm --min-hist-cnt 6 --h1 ${HICF} --h2 ${HICR} \
        --high-het -o ${PREFIX}.asm -t 25 ${READS}

And I was seeing that hifiasm was misidentifying the heterozygous peak as homozygous:

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 19; peak_het: -1
[M::ha_pt_gen::2545.662*19.20] ==> indexed 208812228 positions
[M::ha_assemble::3334.422*20.56@20.844GB] ==> corrected reads for round 3
[M::ha_assemble] # bases: 8052764813; # corrected bases: 64087; # recorrected bases: 12886
[M::ha_assemble] size of buffer: 2.369GB
[M::ha_pt_gen::3385.164*20.54] ==> counted 13555964 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[6] = 146792
[M::ha_analyze_count] highest: count[19] = 448136
[M::ha_hist_line]     1: ****************************************************************************************************> 3492054
[M::ha_hist_line]     2: ****************************************************************************************************> 616557
[M::ha_hist_line]     3: ************************************************************ 268396
[M::ha_hist_line]     4: **************************************** 179832
[M::ha_hist_line]     5: ********************************* 146141
[M::ha_hist_line]     6: ********************************* 146792
[M::ha_hist_line]     7: ********************************** 150284
[M::ha_hist_line]     8: ************************************ 163066
[M::ha_hist_line]     9: ***************************************** 182498
[M::ha_hist_line]    10: ********************************************* 202978
[M::ha_hist_line]    11: **************************************************** 234714
[M::ha_hist_line]    12: ************************************************************ 270546
[M::ha_hist_line]    13: ******************************************************************** 305315
[M::ha_hist_line]    14: **************************************************************************** 339355
[M::ha_hist_line]    15: ************************************************************************************ 375182
[M::ha_hist_line]    16: ****************************************************************************************** 403463
[M::ha_hist_line]    17: ************************************************************************************************* 432605
[M::ha_hist_line]    18: *************************************************************************************************** 445145
[M::ha_hist_line]    19: **************************************************************************************************** 448136
[M::ha_hist_line]    20: *************************************************************************************************** 444365
[M::ha_hist_line]    21: ************************************************************************************************ 430671
[M::ha_hist_line]    22: ******************************************************************************************* 405954
[M::ha_hist_line]    23: *********************************************************************************** 371682
[M::ha_hist_line]    24: ************************************************************************* 329323
[M::ha_hist_line]    25: **************************************************************** 287605
[M::ha_hist_line]    26: ******************************************************** 249192
[M::ha_hist_line]    27: *********************************************** 209243
[M::ha_hist_line]    28: *************************************** 175425

At the end of the run, the .gfa files for hap1 and hap2 both were twice the size (~400Mbp) of the haploid genome size (~200Mbp). I think that for this to work correctly, the assembler will need to know which kmers that are from the 1x (het) peak to get the correct haplotype binning.

lh3 commented 3 years ago

Misidentifying het/hom k-mer peak usually won't have a big effect.

I already have generated a diploid, chromosome-scale assembly for this individual.

How did you do the assembly?

If you generate primary/alternate assembly without Hi-C, what is the size of the primary assembly and alternate assembly, respectively?

lh3 commented 3 years ago

Never mind. I see numbers in #55: 420Mb for primary and 28Mb for alternate if this is the same sample. Hifiasm currently doesn't handle this case well. We are aware of that and trying to improve.

chhylp123 commented 3 years ago

I will expose an option to let users manfully set hom peak as soon as possible. In Hi-C mode, hifiasm needs to use homozygous peak to make sure which unitig is homozygous. In this example, all untigs are identified as homozygous unitigs, so that all of them are assigned to both haplotypes. #55 is caused by stringent threshold in purge_dups, and it also affects the samples with high heterozygous here. I will also fix this problem in the next version.

conchoecia commented 3 years ago

Hi Dr. Li and Dr. Cheng - thank you for your message. Yes, those are the right numbers from #55, 420Mb for primary and 28Mb for alternate.

The way that I assembled the chromosome-scale diploid genome was to:

Assemble the HiFi reads with hifiasm, as described in #55
Concatenate the output of the primary and alternate assemblies into a single fasta file
Scaffold with ~300x coverage of Hi-C data with Dovetail Hirise. One library used MluCI (AATT) and one library used DpnII (GATC).
Verify that the chromosome-scale contigs were haplotypes using whole-genome dotplots and several phasing analyses with Hi-C data.

Let me know if you are interested in taking a look at this dataset or my assembly results. Thank you!

Darrin

chhylp123 commented 3 years ago

Sorry for the late reply. I will push a new version to repo soon and hopefully it can fix the purge_dups and hic issues for your sample.

conchoecia commented 3 years ago

Hey there, just thought I would give an update with the latest hifiasm release -- 0.15-r327.

The software still identifies the het/homs peaks as -1/19 instead of the correct 19/38. The command was hifiasm --min-hist-cnt 6 --h1 ${HICF} --h2 ${HICR} -o ${PREFIX}.asm -t 48 ${READS}.

[M::ha_analyze_count] lowest: count[6] = 3564264
[M::ha_analyze_count] highest: count[19] = 9332182
[M::ha_hist_line]     2: ****************************************************************************************************> 17409506
[M::ha_hist_line]     3: ******************************************************************************* 7372953
[M::ha_hist_line]     4: ************************************************* 4607077
[M::ha_hist_line]     5: *************************************** 3640198
[M::ha_hist_line]     6: ************************************** 3564264
[M::ha_hist_line]     7: ************************************** 3565047
[M::ha_hist_line]     8: **************************************** 3775893
[M::ha_hist_line]     9: ******************************************** 4136235
[M::ha_hist_line]    10: ************************************************* 4565214
[M::ha_hist_line]    11: ******************************************************* 5159593
[M::ha_hist_line]    12: *************************************************************** 5882220
[M::ha_hist_line]    13: *********************************************************************** 6582537
[M::ha_hist_line]    14: ****************************************************************************** 7277044
[M::ha_hist_line]    15: ************************************************************************************* 7949421
[M::ha_hist_line]    16: ******************************************************************************************** 8564319
[M::ha_hist_line]    17: ************************************************************************************************* 9078081
[M::ha_hist_line]    18: *************************************************************************************************** 9272016
[M::ha_hist_line]    19: **************************************************************************************************** 9332182
[M::ha_hist_line]    20: ************************************************************************************************** 9182521
[M::ha_hist_line]    21: *********************************************************************************************** 8879872
[M::ha_hist_line]    22: ***************************************************************************************** 8306297
[M::ha_hist_line]    23: ********************************************************************************* 7571667
[M::ha_hist_line]    24: ************************************************************************ 6685662
[M::ha_hist_line]    25: *************************************************************** 5840484
[M::ha_hist_line]    26: ****************************************************** 5031622
[M::ha_hist_line]    27: ********************************************* 4221481
[M::ha_hist_line]    28: ************************************** 3534484
[M::ha_hist_line]    29: ******************************* 2896339
[M::ha_hist_line]    30: ************************** 2442176
[M::ha_hist_line]    31: ********************** 2074668
[M::ha_hist_line]    32: ******************** 1820350
[M::ha_hist_line]    33: ****************** 1644839
[M::ha_hist_line]    34: ***************** 1558284
[M::ha_hist_line]    35: **************** 1466842
[M::ha_hist_line]    36: **************** 1448174
[M::ha_hist_line]    37: *************** 1430428
[M::ha_hist_line]    38: **************** 1465001
[M::ha_hist_line]    39: **************** 1451824
[M::ha_hist_line]    40: *************** 1434795
[M::ha_hist_line]    41: *************** 1382950
[M::ha_hist_line]    42: ************** 1336731
[M::ha_hist_line]    43: ************** 1266842
[M::ha_hist_line]    44: ************* 1198460
[M::ha_hist_line]    45: ************ 1134650
[M::ha_hist_line]    46: *********** 1058577
[M::ha_hist_line]    47: ********** 954925
[M::ha_hist_line]    48: ********* 869224
[M::ha_hist_line]    49: ********* 796278
[M::ha_hist_line]    50: ******** 706943
[M::ha_hist_line]    51: ******* 621503
[M::ha_hist_line]    52: ****** 549222
[M::ha_hist_line]    53: ***** 476795
[M::ha_hist_line]    54: ***** 428266
.
.
.
[M::ha_hist_line]  rest: ******************************************* 3985128
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_ft_gen] peak_hom: 19; peak_het: -1
[M::ha_ft_gen::429.341*7.80@20.853GB] ==> filtered out 3811358 k-mers occurring 95 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::624.807*7.41] ==> counted 18097002 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[6] = 152684

chhylp123 commented 3 years ago

You can manually set hom peak by '--purge-cov'. It is hard for hifiasm to get right peak in this case. Please note it just affects phasing and graph cleaning, so that no need to run the whole assembly workflow.

chhylp123 commented 3 years ago

By the way, could you please show the following number in the log file?


[M::purge_dups] purge duplication coverage threshold:

conchoecia commented 3 years ago

Hi @chhylp123 - Where in the k-mer spectrum should --purge-cov be set? To the left of the het peak, In the trough between the het peak and the hom peak? To the right of the hom peak?

There were two instances of [M::purge_dups] purge duplication coverage threshold: 47 in the log file for the v0.15-r327 run.

Good news on the genome assembly side of things. The assembly stats for v0.15-r327 were closer to the estimated genome size of 174 Mb (348 Mb for both haplotypes) (from k-mer spectrum):

Total size (Hap1 + Hap2 = 553.6 Mb)
- Hap1:
- 759 contigs
- 291.2 Mb
- N50 of 1.5 Mb
- Hap2:
- 491 contigs
- 262.4 Mb
- N50 of 1.7 Mb

For comparison, the assembly stats for the previous version of hifiasm Hi-C mode that I posted above were:

Total size (Hap1 + Hap2 = 834 Mb)
- Hap1:
- 890 contigs
- 426.7 Mb
- N50 of 1.2 Mb
- Hap2:
- 707 contigs
- 407.3 Mb
- N50 of 1.2 Mb

Running the assembler v0.15-r327 without Hi-C hifiasm --min-hist-cnt 6 -o ${PREFIX}.asm had these stats:

Total size (Hap1 + Hap2 = 838 Mb)
- Hap1:
- 838 contigs
- 429.8 Mb
- N50 of 1.8 Mb
- Hap2:
- 595 contigs
- 411.4 Mb
- N50 of ?? Mb
So, it looks like Hi-C mode may be working better than the other modes at the moment.

chhylp123 commented 3 years ago

Actually hifiasm recalculates peaks during graph cleaning step, which is different with the k-mer peaks showing in the log file. [M::purge_dups] purge duplication coverage threshold: 47 indicates the real hom peak for phasing. I think it already identifies the right peak for phasing.

conchoecia commented 3 years ago

OK - I figured it was based on mapping coverage rather thank k-mer coverage. I'll scaffold these up to chromosome scale and let you know what I find.

chhylp123 commented 3 years ago

But the assembly size looks a little bit large. I worry hifiasm still misidentified some het regions as hom. Could you please set higher hom peak by '--purge-cov', e.g. 50? Could you please also show the 'p_utg.gfa'?

conchoecia commented 3 years ago

Sure, Here are the results below. Trial 4. Pretty much the same as trial 3. Column Hap1 + Hap2 Size (Mb) should be around 356 Mb (178 Mb * 2).	Trial Num	Hifiasm ver	HiC?	--min-hist-cnt	--purge-cov	Hap1 + Hap2 Size (Mb)	Hap1 size (Mb)	Hap1 # tigs	Hap 1 N50 (Mb)	Hap 2 size (Mb)	Hap 2 # tigs
1	v0.13?	yes	?	NA	834	426.7	890	1.12	407.3	707	1.2
2	0.15-r327	no	6	NA	841.2	429.8	838	1.8	411.4	595	?
3	0.15-r327	yes	6	NA	553.6	291.2	759	1.5	262.4	491	1.7
4	0.15-r327	yes	6	50	558.2	292.8	771	1.6	265.4	462	1.7

Here is p_utg.gfa:

I also cat'd the hap1 and hap2 fasta files, and scaffolded the output to chromosome-scale with Hi-C data to make a diploid assembly. I then made a Hi-C heatmap, pictured below. As you can see the chromsome-sized scaffolds look well-assembled. Closer inspection shows that there are some chunks missing from one haplotype a few chromosomes here and there, but overall it looks good.

A lot of the smaller scaffolds, however, have no Hi-C reads that map to them over hundreds of kb/ a Mb. This means that the assembler is outputting something that isn't real, or the scaffold is just hundreds of kb of repeats that are so information-poor that no reads could map.

Because of the chunks of haplotypes that seem like they are dropped with Hi-C mode, I still think I will be better off by running hifiasm without Hi-C data, then scaffolding.

chhylp123 commented 3 years ago

@conchoecia Sorry I missed your reply. For scaffolding, do you mean the balanced two haplotypes without Hi-C are better than phased haplotypes with Hi-C, or primary + alternate are better than than phased haplotypes with Hi-C? For Trial 2, did hifiasm find right peaks during graph cleaning?

conchoecia commented 3 years ago

@chhylp123 I mean that primary + alternate from hifiasm v0.13, concatenated together, then scaffolded was more complete than trial 4 above, hap1 + hap2 concatenated from hifiasm v0.15 ran with Hi-C data, then scaffolded. I am scaffolding the results of trial 2 right now (v0.15, no Hi-C), so am not sure if v0.15 w/ Hi-C or v0.15 without Hi-C is more complete.

For trial 2, and all the other trials, hifiasm did not identify the hom peak. For trial 2, and all other trials, hifiasm called the true het peak (19x cov) as the hom peak, and said that the het peak's value was -1.

Let me know if there's any results that I could share that would be helpful to you! Cheers, Darrin

chhylp123 commented 3 years ago

@conchoecia Thanks a lot. We are thinking the main challenge for phasing is that hifiasm cannot distinguish het regions and hom regions exactly. It seems it is more serious in your sample. Could you please share the bin files of hifiasm with us for debugging? It is very helpful for us.

conchoecia commented 3 years ago

I sent the files via ftp to your email, @chhylp123. Thank you.

chhylp123 commented 3 years ago

@conchoecia For small bubbles and tips in this local subgraph, is it possible to check they are assembly errors or somatic mutations? I checked some of bubbles and found one side has ordinary coverage while another side has very low coverage. I guess so many such small bubbles and tips affect the final results.

conchoecia commented 3 years ago

Hi @chhylp123 - can you possibly send me some sequences associated with this subgraph? Could be that it was bacterial, or some other bug's, comtamination that I was able to remove after Hi-C scaffolding. Otherwise, sure, I can look in my assemblies to see if there are possible errors. Thank you- Darrin

chhylp123 commented 3 years ago

There are too many such bubbles. I wonder all small bubbles and tigs at the subgraph in red circle are not real (it is the p_utg you showed with us). Besides, could you please run hifiasm with smaller '-s'? With -s0.3 the size of each haplotype should be reduced to 240Mb. Since the het rate is quite high, we need to use smaller ‘-s’ to find overlaps with low similarity.

chhylp123 commented 3 years ago

You can run minimap2 to do self-alignment on top of hap1 or hap2. With '-s0.3', it seems only the unitigs in that complex subgraph have overlaps. I'm not sure if they are somatic mutations. Probably I was wrong..

chhylp123 commented 3 years ago

@conchoecia I found the main problem is that there are no overlaps for a part of unitigs. With -s0.3 (similarity of 30%), hifiasm outputs haplotype 1 of 251Mb and haplotype 2 of 240Mb. There are 53Mb unitigs which do not have any overlaps so that they are partitioned into both haplotypes. In those 53Mb contigs, the total size of contigs which are longer than 1Mb is 15Mb. And the longest one is a circle of size 3.9Mb. I'm not sure what are those. Probably they are contaminations, or short contigs with very high het rate so that hifiasm cannot identify overlaps correctly.

conchoecia commented 2 years ago

Hi @chhylp123 - Thank you for your time looking at this graph! I've been going through old, unclosed issues and found this one. I'll close it for now since a lot of time has lapsed, but will open it again if I can get back to this issue in this assembly.

chhylp123 / hifiasm

Hifiasm doesn't find heterozygous kmer peak correctly - consequently HiC mode fails #78