chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
529 stars 86 forks source link

Homotetraploid, super-large genome, with different parameters, the size of p_utg varies greatly? #632

Open GLking123 opened 6 months ago

GLking123 commented 6 months ago

Dear author, Thank you for developing such a milestone software, which greatly accelerates the efficiency of assembly.

I am currently conducting assembly of a large genome of a plant species, which is a homologous tetraploid with a genome size of approximately 55 gigabases (G). Currently, I only have HiFi data available. I have employed three assembly strategies, outlined as follows:

  1. hifiasm -t 120 -l 0 the generated .p_ctg.gfa file is of size 55G, and the .p_utg.gfa file is of size 75G.
  2. hifiasm --n-hap 2 -t 120 -l 0 the generated .p_ctg.gfa file is of size 55G, and the .p_utg.gfa file is of size 56G.
  3. hifiasm --n-hap 4 -t 120 -l 0 the generated .p_ctg.gfa file is of size 56G, and the .p_utg.gfa file is of size 76G.

Using flow cytometry, the estimated genome size is approximately 50 G.

I used HapHic to scaffold chromosomes, but encountered numerous errors. Perhaps using p_utg would yield better results?

Currently, the generated size of p_utg with the --n-hap 2 parameter meets expectations. Can the generated p_utg be used?

What is the difference between using --n-hap without specifying a number and using --n-hap 4? Why is the size of p_utg significantly larger when using --n-hap 4 compared to --n-hap 2?

The following is the k-mer graph generated by Hifiasm:

Snipaste_2024-04-01_12-14-23 Snipaste_2024-04-01_12-14-45

For the above question, could you provide some debugging suggestions? Thank you for your valuable time and assistance. I sincerely look forward to your response!

chhylp123 commented 6 months ago

--n-hap is used to determine the coverage of heterozygous nodes or contigs. For your sample, hifiasm thinks the homozygous coverage is 26, and the heterozygous coverages are 26/2 = 13 and 26/4 = 6 using --n-hap 2 and --n-hap 4, respectively. Hifiasm keeps any node in the assembly graph with coverage above the heterozygous coverage threshold as a real node, instead of sequencing errors. This is why --n-hap 4 leads to a larger graph. Could you please have a try with --hom-cov 55 and --n-hap 2? Since bv looking at the k-mer plot, there are only two peaks and the homozygous coverage should be 55.

GLking123 commented 4 months ago

Dear author, I tried your suggestions, and here are the results:

hifiasm --n-hap 2 -t 120 -l 0 --hom-cov 55 the generated .p_ctg.gfa file is of size 56G, and the .p_utg.gfa file is of size 66G.

Since mine is a homologous tetraploid, which form should I choose for assembly, p_ctg or p_utg?

p_ctg N50: 100MB p_utg N50: 1MB

I believe that increasing the depth of HiFi data will not increase the length of N50.

For the above question, could you provide some debugging suggestions? Thank you for your valuable time and assistance. I sincerely look forward to your response!