optimizing unitig size and amount of allelic sequences

dcopetti commented 2 years ago

Hello, I am assembling a highly heterozygous plant genome (haploid genome of ~2.6 Gb) from about 42x HiFi coverage (21x per allele). My priority is to have phased sequences to scaffold later on, so I am just focusing on obtaining correct (containing no phase switches) unitigs, reaching as close as possible to the diploid size of the genome (see https://github.com/lh3/minimap2/issues/801) I attach one of the stdout files so that you can see the k-mer distribution.

sikem_hifiasm_l3s50k61_stdout.txt I did some assemblies with the command hifiasm -t 200 -l3 -o l3s50 -s50 -k61 --hg-size 2600m, always keeping -l3, playing with -s (from 55 to 30), and -k (51, 61, 63).

The results I get are good, compared to sequencing similar genomes with other technologies:

| l3s50k51.bp.p_utg | l3s50k61.bp.p_utg | l3s50k63.bp.p_utg -- | -- | -- | -- total_length | 4,594,997,065 | 4,609,483,986 | 4,611,185,453 number | 17,103 | 17,867 | 17,952 mean_length | 268,666 | 257,989 | 256,862 longest | 24,074,508 | 14,360,724 | 14,254,557 shortest | 7,073 | 5,978 | 3,768 N50 | 2,410,076 | 2,154,964 | 2,130,981 N50n | 508 | 602 | 602 N70 | 1,150,861 | 1,046,279 | 1,052,370 N70n | 1,055 | 1,209 | 1,211 N90 | 88,462 | 80,864 | 79,253 N90n | 3,652 | 4,077 | 4,102

And the KAT plots look also very good, with almost no k-mers missing and very little collapsed homozygous regions

Capture

I wonder if there is any parameter I should focus on to maximize (in this order of priority) phasing accuracy, total assembly size (i.e. representation of non-error read's k-mers in the assembly), and Nx values.

What I have seen so far: At the error correction stage, higher k values increase total size and # of unitigs, decreasing Nx (OK with the latter). At the assembly stage, total size increases and Nx decrease only between 55 and 50, then at 40, 35, and 30 they all remain the same.

Where else can I act to get highly accurate unitigs? I would not mind increasing rounds of correction or graph cleaning (one assembly takes ~8 hours), provided the assembly that results is the good one. Thanks, Dario

dcopetti commented 2 years ago

another way to look at the (impressive!) quality of these assemblies and where I would like to act:

Capture

This is the distribution of the coverage upon realignment of the error corrected reads against the unitigs (mean coverage in 10 kb windows). Basically, I would like to minimize the small hump above x = 38. Those are (regions of) unitigs that have twice as many reads aligned against them, so are potential homozygous regions that are (still) collapsed. I am aware that such fraction can not be zero, but I would like to make sure I do all that is possible to minimize it. I think that with the high accuracy of the input HiFi reads (QV>31) at the current coverage, a single SNP can distinguish the two alleles of a read. Thoughts?

chhylp123 commented 2 years ago

Cool results and thanks for sharing.

It seems you are focusing on the p_utg so that -l and -s won't affect the results. -l and '-s' are designed for purging. p_utg might be also affected by the option --hom-cov. The p_utg is generated from r_utg by collapsing somatic mutations with coverage less than the homozygous coverage threshold determined by --hom-cov. So larger value for --hom-cov will make unitigs longer at the expense of more collapsed regions. But I guess the p_utg won't change a lot unless the homozygous coverage threshold is totally wrong.

If you want to get phased sequences, I'd recommend you to have a try with the Hi-C module of hifiasm. Here is an example on a complex plant sample: https://github.com/baozg/phased-assembly-check#3-quality-check. As we can see the HiC (0.16.1) assembly is even better than trio-binning assembly in terms of phasing.

chhylp123 commented 2 years ago

And here is the preprint including the results of human and animals: https://arxiv.org/abs/2109.04785

dcopetti commented 2 years ago

Thanks for the feedback! Unfortunately we don't have Hi-C data (probably we will get it though), and this attempt is for a different way of scaffolding in a phased manner. Regarding --hom-cov, is it better to let the assembler choose it, or should I direct it towards a value? If bubbles are collapsed when the coverage is below this value, is it the case to move it to the heterozygous peak instead? (if I got it right from your explanation of how it works) Or would tweaking --hg-size help?

With the goal of having error corrected reads with as many k-mers from true SNPs and as less k-mers from sequencing errors as possible, is it worth to edit any of the parameters in the error correction step? thanks!

chhylp123 commented 2 years ago

Actually I do think your assembly is already pretty good.

Regarding --hom-cov, is it better to let the assembler choose it, or should I direct it towards a value? ------ From the log file, probably you can mannually set --hom-cov 42 during assembly, since hifiasm automatically identified the homozygous coverage threshold as 39. But I guess it won't affect results too much.
If bubbles are collapsed when the coverage is below this value, is it the case to move it to the heterozygous peak instead? ------ Yes, and there shouldn't be too many such bubbles. So I guess it won't affect results too much.
With the goal of having error corrected reads with as many k-mers from true SNPs and as less k-mers from sequencing errors as possible, is it worth to edit any of the parameters in the error correction step? ------ For correction, probably you can try -D10. This option might be helpful to resolve more segdups instead of phasing.

chhylp123 / hifiasm

optimizing unitig size and amount of allelic sequences #190