Closed conchoecia closed 2 years ago
Misidentifying het/hom k-mer peak usually won't have a big effect.
I already have generated a diploid, chromosome-scale assembly for this individual.
How did you do the assembly?
If you generate primary/alternate assembly without Hi-C, what is the size of the primary assembly and alternate assembly, respectively?
Never mind. I see numbers in #55: 420Mb for primary and 28Mb for alternate if this is the same sample. Hifiasm currently doesn't handle this case well. We are aware of that and trying to improve.
I will expose an option to let users manfully set hom peak as soon as possible. In Hi-C mode, hifiasm needs to use homozygous peak to make sure which unitig is homozygous. In this example, all untigs are identified as homozygous unitigs, so that all of them are assigned to both haplotypes. #55 is caused by stringent threshold in purge_dups, and it also affects the samples with high heterozygous here. I will also fix this problem in the next version.
Hi Dr. Li and Dr. Cheng - thank you for your message. Yes, those are the right numbers from #55, 420Mb for primary and 28Mb for alternate.
The way that I assembled the chromosome-scale diploid genome was to:
Let me know if you are interested in taking a look at this dataset or my assembly results. Thank you!
Sorry for the late reply. I will push a new version to repo soon and hopefully it can fix the purge_dups and hic issues for your sample.
Hey there, just thought I would give an update with the latest hifiasm release -- 0.15-r327.
The software still identifies the het/homs peaks as -1/19 instead of the correct 19/38. The command was hifiasm --min-hist-cnt 6 --h1 ${HICF} --h2 ${HICR} -o ${PREFIX}.asm -t 48 ${READS}
.
[M::ha_analyze_count] lowest: count[6] = 3564264
[M::ha_analyze_count] highest: count[19] = 9332182
[M::ha_hist_line] 2: ****************************************************************************************************> 17409506
[M::ha_hist_line] 3: ******************************************************************************* 7372953
[M::ha_hist_line] 4: ************************************************* 4607077
[M::ha_hist_line] 5: *************************************** 3640198
[M::ha_hist_line] 6: ************************************** 3564264
[M::ha_hist_line] 7: ************************************** 3565047
[M::ha_hist_line] 8: **************************************** 3775893
[M::ha_hist_line] 9: ******************************************** 4136235
[M::ha_hist_line] 10: ************************************************* 4565214
[M::ha_hist_line] 11: ******************************************************* 5159593
[M::ha_hist_line] 12: *************************************************************** 5882220
[M::ha_hist_line] 13: *********************************************************************** 6582537
[M::ha_hist_line] 14: ****************************************************************************** 7277044
[M::ha_hist_line] 15: ************************************************************************************* 7949421
[M::ha_hist_line] 16: ******************************************************************************************** 8564319
[M::ha_hist_line] 17: ************************************************************************************************* 9078081
[M::ha_hist_line] 18: *************************************************************************************************** 9272016
[M::ha_hist_line] 19: **************************************************************************************************** 9332182
[M::ha_hist_line] 20: ************************************************************************************************** 9182521
[M::ha_hist_line] 21: *********************************************************************************************** 8879872
[M::ha_hist_line] 22: ***************************************************************************************** 8306297
[M::ha_hist_line] 23: ********************************************************************************* 7571667
[M::ha_hist_line] 24: ************************************************************************ 6685662
[M::ha_hist_line] 25: *************************************************************** 5840484
[M::ha_hist_line] 26: ****************************************************** 5031622
[M::ha_hist_line] 27: ********************************************* 4221481
[M::ha_hist_line] 28: ************************************** 3534484
[M::ha_hist_line] 29: ******************************* 2896339
[M::ha_hist_line] 30: ************************** 2442176
[M::ha_hist_line] 31: ********************** 2074668
[M::ha_hist_line] 32: ******************** 1820350
[M::ha_hist_line] 33: ****************** 1644839
[M::ha_hist_line] 34: ***************** 1558284
[M::ha_hist_line] 35: **************** 1466842
[M::ha_hist_line] 36: **************** 1448174
[M::ha_hist_line] 37: *************** 1430428
[M::ha_hist_line] 38: **************** 1465001
[M::ha_hist_line] 39: **************** 1451824
[M::ha_hist_line] 40: *************** 1434795
[M::ha_hist_line] 41: *************** 1382950
[M::ha_hist_line] 42: ************** 1336731
[M::ha_hist_line] 43: ************** 1266842
[M::ha_hist_line] 44: ************* 1198460
[M::ha_hist_line] 45: ************ 1134650
[M::ha_hist_line] 46: *********** 1058577
[M::ha_hist_line] 47: ********** 954925
[M::ha_hist_line] 48: ********* 869224
[M::ha_hist_line] 49: ********* 796278
[M::ha_hist_line] 50: ******** 706943
[M::ha_hist_line] 51: ******* 621503
[M::ha_hist_line] 52: ****** 549222
[M::ha_hist_line] 53: ***** 476795
[M::ha_hist_line] 54: ***** 428266
.
.
.
[M::ha_hist_line] rest: ******************************************* 3985128
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_ft_gen] peak_hom: 19; peak_het: -1
[M::ha_ft_gen::429.341*7.80@20.853GB] ==> filtered out 3811358 k-mers occurring 95 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::624.807*7.41] ==> counted 18097002 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[6] = 152684
You can manually set hom peak by '--purge-cov'. It is hard for hifiasm to get right peak in this case. Please note it just affects phasing and graph cleaning, so that no need to run the whole assembly workflow.
By the way, could you please show the following number in the log file?
[M::purge_dups] purge duplication coverage threshold:
Hi @chhylp123 - Where in the k-mer spectrum should --purge-cov
be set? To the left of the het peak, In the trough between the het peak and the hom peak? To the right of the hom peak?
There were two instances of [M::purge_dups] purge duplication coverage threshold: 47
in the log file for the v0.15-r327 run.
Good news on the genome assembly side of things. The assembly stats for v0.15-r327 were closer to the estimated genome size of 174 Mb (348 Mb for both haplotypes) (from k-mer spectrum):
For comparison, the assembly stats for the previous version of hifiasm
Hi-C mode that I posted above were:
Running the assembler v0.15-r327 without Hi-C hifiasm --min-hist-cnt 6 -o ${PREFIX}.asm
had these stats:
Total size (Hap1 + Hap2 = 838 Mb)
So, it looks like Hi-C mode may be working better than the other modes at the moment.
Actually hifiasm recalculates peaks during graph cleaning step, which is different with the k-mer peaks showing in the log file. [M::purge_dups] purge duplication coverage threshold: 47
indicates the real hom peak for phasing. I think it already identifies the right peak for phasing.
OK - I figured it was based on mapping coverage rather thank k-mer coverage. I'll scaffold these up to chromosome scale and let you know what I find.
But the assembly size looks a little bit large. I worry hifiasm still misidentified some het regions as hom. Could you please set higher hom peak by '--purge-cov', e.g. 50? Could you please also show the 'p_utg.gfa'?
Sure, Here are the results below. Trial 4. Pretty much the same as trial 3. Column Hap1 + Hap2 Size (Mb) should be around 356 Mb (178 Mb * 2). | Trial Num | Hifiasm ver | HiC? | --min-hist-cnt | --purge-cov | Hap1 + Hap2 Size (Mb) | Hap1 size (Mb) | Hap1 # tigs | Hap 1 N50 (Mb) | Hap 2 size (Mb) | Hap 2 # tigs | Hap2 N50 (Mb) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | v0.13? | yes | ? | NA | 834 | 426.7 | 890 | 1.12 | 407.3 | 707 | 1.2 | |
2 | 0.15-r327 | no | 6 | NA | 841.2 | 429.8 | 838 | 1.8 | 411.4 | 595 | ? | |
3 | 0.15-r327 | yes | 6 | NA | 553.6 | 291.2 | 759 | 1.5 | 262.4 | 491 | 1.7 | |
4 | 0.15-r327 | yes | 6 | 50 | 558.2 | 292.8 | 771 | 1.6 | 265.4 | 462 | 1.7 |
Here is p_utg.gfa
:
I also cat
'd the hap1 and hap2 fasta files, and scaffolded the output to chromosome-scale with Hi-C data to make a diploid assembly. I then made a Hi-C heatmap, pictured below. As you can see the chromsome-sized scaffolds look well-assembled. Closer inspection shows that there are some chunks missing from one haplotype a few chromosomes here and there, but overall it looks good.
A lot of the smaller scaffolds, however, have no Hi-C reads that map to them over hundreds of kb/ a Mb. This means that the assembler is outputting something that isn't real, or the scaffold is just hundreds of kb of repeats that are so information-poor that no reads could map.
Because of the chunks of haplotypes that seem like they are dropped with Hi-C mode, I still think I will be better off by running hifiasm
without Hi-C data, then scaffolding.
@conchoecia Sorry I missed your reply. For scaffolding, do you mean the balanced two haplotypes without Hi-C are better than phased haplotypes with Hi-C, or primary + alternate are better than than phased haplotypes with Hi-C? For Trial 2, did hifiasm find right peaks during graph cleaning?
@chhylp123 I mean that primary + alternate from hifiasm v0.13, concatenated together, then scaffolded
was more complete than trial 4 above, hap1 + hap2 concatenated from hifiasm v0.15 ran with Hi-C data, then scaffolded
. I am scaffolding the results of trial 2 right now (v0.15, no Hi-C), so am not sure if v0.15 w/ Hi-C
or v0.15 without Hi-C
is more complete.
For trial 2, and all the other trials, hifiasm did not identify the hom peak. For trial 2, and all other trials, hifiasm called the true het peak (19x cov) as the hom peak, and said that the het peak's value was -1.
Let me know if there's any results that I could share that would be helpful to you! Cheers, Darrin
@conchoecia Thanks a lot. We are thinking the main challenge for phasing is that hifiasm cannot distinguish het regions and hom regions exactly. It seems it is more serious in your sample. Could you please share the bin files of hifiasm with us for debugging? It is very helpful for us.
I sent the files via ftp to your email, @chhylp123. Thank you.
@conchoecia For small bubbles and tips in this local subgraph, is it possible to check they are assembly errors or somatic mutations? I checked some of bubbles and found one side has ordinary coverage while another side has very low coverage. I guess so many such small bubbles and tips affect the final results.
Hi @chhylp123 - can you possibly send me some sequences associated with this subgraph? Could be that it was bacterial, or some other bug's, comtamination that I was able to remove after Hi-C scaffolding. Otherwise, sure, I can look in my assemblies to see if there are possible errors. Thank you- Darrin
There are too many such bubbles. I wonder all small bubbles and tigs at the subgraph in red circle are not real (it is the p_utg you showed with us). Besides, could you please run hifiasm with smaller '-s'? With -s0.3 the size of each haplotype should be reduced to 240Mb. Since the het rate is quite high, we need to use smaller ‘-s’ to find overlaps with low similarity.
You can run minimap2 to do self-alignment on top of hap1 or hap2. With '-s0.3', it seems only the unitigs in that complex subgraph have overlaps. I'm not sure if they are somatic mutations. Probably I was wrong..
@conchoecia I found the main problem is that there are no overlaps for a part of unitigs. With -s0.3 (similarity of 30%), hifiasm outputs haplotype 1 of 251Mb and haplotype 2 of 240Mb. There are 53Mb unitigs which do not have any overlaps so that they are partitioned into both haplotypes. In those 53Mb contigs, the total size of contigs which are longer than 1Mb is 15Mb. And the longest one is a circle of size 3.9Mb. I'm not sure what are those. Probably they are contaminations, or short contigs with very high het rate so that hifiasm cannot identify overlaps correctly.
Hi @chhylp123 - Thank you for your time looking at this graph! I've been going through old, unclosed issues and found this one. I'll close it for now since a lot of time has lapsed, but will open it again if I can get back to this issue in this assembly.
Hi Haoyu,
This is related to #55, where I noticed that hifiasm was not correctly identifying the heterozygous and homozygous peaks. The animal that I'm working with is very heterozygous (>4%), so the kmer spectrum has a very large het peak, and the hom peak is much smaller. I know that the animal is not polyploid, as I already have generated a diploid, chromosome-scale assembly for this individual.
I ran the new version of hifiasm with these parameters:
And I was seeing that hifiasm was misidentifying the heterozygous peak as homozygous:
At the end of the run, the
.gfa
files for hap1 and hap2 both were twice the size (~400Mbp) of the haploid genome size (~200Mbp). I think that for this to work correctly, the assembler will need to know which kmers that are from the 1x (het) peak to get the correct haplotype binning.