chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
547 stars 87 forks source link

Issues with Hexaploid Assembly #659

Open BioOmics opened 5 months ago

BioOmics commented 5 months ago

Hi, @chhylp123

We are currently attempting to assemble a hexaploid using hifiasm with the following command:

hifiasm -t 100 -l3 --n-hap 6 --dual-scaf --ul ${ont} --h1 ${hic1} --h2 ${hic2} ${hifi}

The results are as follows: hap

Our questions

Why do we have 4 haplotypes (hapN.p_ctg) with a size of ~670MB (boxed in red) and 2 haplotypes with a size of ~520MB (boxed in green), while the p_ctg is ~320MB (boxed in blue)? Is this indicating that our hexaploid species has a ploidy composition of AAAABB? Furthermore, if we consider A to be roughly 670MB and B to be roughly 520MB, how can we account for the p_ctg size of approximately 320MB ?

By the way, here is an assessment of the species' genome size (~221MB) and heterozygosity (5.71%), which might be helpful for you to understand our query: size

By the way, here is another result by using l0:

hifiasm -t 100 -l0 --n-hap 6 --dual-scaf --ul ${ont} --h1 ${hic1} --h2 ${hic2} ${hifi}

l0

If possible, how can we use the hifiasm command to achieve the best genome assembly, or basic primary genome assembly?

We are eagerly looking forward to your reply.

xiekunwhy commented 4 months ago

also hope some one can answer this question. And explain the expected the hap size and the mixed (.hic.p_ctg.fa) size.

baozg commented 4 months ago

If you have read the hifiasm-UL paper, the imbalance haplotype size for HiC phasing is a known issue. Currently, hifiasm still cannot phase autopolyploid genomes with HiC. But if you have any other additional information (genetics map) or phasing by HapHiC or AllHiC, you could use -5 to reassign the haplotypes with more contiguous assembly. Even with HiFi and UL data, the phasing with HiC is still difficult for diplotigs and triplotigs for the tetraploid potato.

For the polyploid genome assembly, the main limitation of our current algorithm is that it requires genetic map information from progeny. To address this issue, we implemented an experimental single-sample approach using Hi-C phasing, and applied it to the autotetraploid potato dataset. This resulted in four haplotype assemblies, which have slightly worse phasing accuracy and contiguity in comparison to the genetic map-based assemblies. However, the four Hi-C phased haplotype assemblies are imbalanced, with one assembly being 20% larger than the others.

awesomedeer commented 3 months ago

If you have read the hifiasm-UL paper, the imbalance haplotype size for HiC phasing is a known issue. Currently, hifiasm still cannot phase autopolyploid genomes with HiC. But if you have any other additional information (genetics map) or phasing by HapHiC or AllHiC, you could use -5 to reassign the haplotypes with more contiguous assembly. Even with HiFi and UL data, the phasing with HiC is still difficult for diplotigs and triplotigs for the tetraploid potato.

For the polyploid genome assembly, the main limitation of our current algorithm is that it requires genetic map information from progeny. To address this issue, we implemented an experimental single-sample approach using Hi-C phasing, and applied it to the autotetraploid potato dataset. This resulted in four haplotype assemblies, which have slightly worse phasing accuracy and contiguity in comparison to the genetic map-based assemblies. However, the four Hi-C phased haplotype assemblies are imbalanced, with one assembly being 20% larger than the others.

Hi, @baozg Can I ask a question regarding this issue about tetraploid? For a tetraploid, --n-hap need to set to 4 or 2? Or it depends on if it's auto/allo polyploid?

Best regards Song

baozg commented 3 months ago

--n-hap should set to the right ploidy level, the default assume a diploid. For a tetraploid, you should set --n-hap 4, that's how I run for potato