chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
529 stars 86 forks source link

Tetraploid with high heterozygosity #708

Open Liyong-Zhang opened 5 days ago

Liyong-Zhang commented 5 days ago

Hi there,

I am using Hifiasm (version 0.19.7-r598) to assemble a plant genome (2n=28) with HiFi and Hi-C data.
First, assuming it’s a diploid plant, so I run with command: hifiasm -o OS010681.asm -t 64 --h1 hic_r1.fastq.gz --h2 hic_r2.fastq.gz OS010681.hifi.fq.gz

The result file OS010681.asm.hic.p_ctg.gfa was used for running a mummerplot with A. thaliana genome as reference mummerplot_v6

According to the mummerplot, this plant looks like a tetraploid. To check the heterozygous rate, I run GenomeScope2 with the HiFi read ( p4_transformed_linear_plot p4_summary.txt), its heterozygous rate is quite high (~8%).

Then, I re-read the FAQs before re-run the assembly. (https://hifiasm.readthedocs.io/en/latest/faq.html#which-types-of-assemblies-should-i-use) mentioned “if Hi-C data is available, hic.hap.p_ctg.gfa produced in Hi-C mode is the best choice”, and (https://hifiasm.readthedocs.io/en/latest/faq.html#are-polyploid-genomes-supported) mentioned that “ The r_utg.gfa and p_utg.gfa are lossless so that they also work for polyploid genomes. However, currently the contig-generation modules of hifiasm are designed for diploid samples, which means both the partially phased assembly and the fully-phased assembly does not directly support polyploid genomes”.

I also refer issues #571, then I re-run hifiasm with command hifiasm -o OS010681.asm.v2 -t 64 -s 0.25 --n-hap 4 --h1 hic_r1.fastq.gz --h2 hic_r2.fastq.gz OS010681.hifi.fq.gz. I got OS010681.asm.v2.hic.p_ctg.gfa (276M) with four hap files: OS010681.asm.v2.hic.hap1.p_ctg.gfa (296M), OS010681.asm.v2.hic.hap2.p_ctg.gfa (267M), OS010681.asm.v2.hic.hap3.p_ctg.gfa(276M), OS010681.asm.v2.hic.hap4.p_ctg.gfa (353M).

In #431, you mentioned that ”If you have HiC reads, the latest release Hifiasm-0.19.3-r572 will give you 4 haplotypes. But the results might be not perfect right now”, I am quite confused right now, which assembly files should I use for further scaffolding in yahs?

Also, I am wondering whether you could help me with the following questions as well: Q1, I noticed that the OS010681.asm.v2.hic.p_ctg.gfa (276M) is much smaller than the previous run OS010681.asm.hic.p_ctg.gfa (387M). What causes this difference?

Q2, (https://hifiasm.readthedocs.io/en/latest/faq.html#are-polyploid-genomes-supported) mentioned “The r_utg.gfa and p_utg.gfa are lossless so that they also work for polyploid genomes”, I am wondering what’s difference between p_utg.gfa vs p_ctg.gfa? How could I use the information from p_utg.gfa for my polyploid assembly?

Q3, #431, you mentioned that “mannually set --hom-cov to the homozygous coverage”, could you clarify how big the impact is by manually setting the hom-cov value? Also please provide a little bit more details about how to calculate the homozygous coverage if possible?

Q4, https://github.com/chhylp123/hifiasm/issues/537, you mentioned that “-l0 is designed for the homozygous sample, which will disable diploid phasing. Please do not use -l0 for the Hi-C phasing”. What’s the default value for -l in Hi-C assembly when run Hifiasm?

Sorry about the long question list, thank you so much for your help!