chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
505 stars 84 forks source link

Duplicated sequence "h1tg000001l" when `--n-hap 4 ` #489

Open zhangrengang opened 12 months ago

zhangrengang commented 12 months ago

When using --n-hap 4, all the four hifiasm.hic.hap*.p_ctg.gfa have the same sequence ID like "h1tg000001l".

zhangrengang commented 12 months ago

I have another issue with --n-hap 4. It in fact output 8 haplotypes in size (2.8 Gb in total). While using default --n-hap 2, it output 1.5 Gb which is expected for our autotretaploid genome. However, the 1.5 Gb have missed some large regions (homoeologous collaspe), as confirmed by aligning with the reference and analyzing the coverage depth.

baozg commented 12 months ago

The same h1tg for all haplotypes is a known bug when we use hifiasm for tetraploid potato. But I never saw hifiasm will output 8 haplotypes when you use --n-hap 4. Do you have all the logs for this run? HiC-based phasing for polyploidy is still very unstable as I know, it depends on the heterozygous variants distribution of autotetraploid

chhylp123 commented 11 months ago

Yes, I agree with @baozg. Do you have the log file for hifiasm?

monian1113 commented 11 months ago

Hi, I ran into the same problem, my genome is a triploid, kmer predicts the genome size to be around 700M for a single haplotye, and whole genome size should be 2~2.1G, when I use version 0.19.5-r587 with the parameter "--n-hap 3 --h1 hic_R1.fastq --h2 hic_R2. fastq" , the result is hifi.hic.hap1.p_ctg.gfa.fa,1.5G; hifi.hic.hap2.p_ctg.gfa.fa,1008M; hifi.hic.hap3.p_ctg.gfa.fa,825M; hifi.hic.p_ctg.gfa.fa; and hifi.hic.p_ctg.gfa.fa. ctg.gfa.fa,1.5G; hifi.hic.p_utg.gfa.fa,2.3G; homozygous read coverage threshold: 33. Then when I add "--hom-cov 17", the result is hifi .hic.hap1.p_ctg.gfa.fa,2.0G; hifi.hic.hap2.p_ctg.gfa.fa,2.0G; hifi.hic.hap3.p_ctg.gfa.fa,2.0G; hifi.hic.pctg.gfa.fa,2.1G; hifi.hic.p utg.gfa.fa,2.3G. According to the size of each hap, it looks like that each hap contains all 3 sets of sequences. Is it possible that I am using the parameters incorrectly?

Also, when I use version 0.16.1-r375 with parameter "--n-hap 3 --h1 hic_R1.fastq --h2 hic_R2.fastq" , the result is hifi_hic.hic.hap1.p_ctg.fa,657M hifi_hic.hic.hap2.p_ctg.fa,1.5G; hifi_hic.hic.p_ctg.gfa.fa,1.5G; hifi_hic.hic.p_utg.fa,2.2G; hifi_hic.hic.r_utg.gfa.fa,2.2G; and its hap1 and hap2 sizes are consistent with the state of my AAB triploid genome. When I use p_utg for 3ddna, the sequence is too fragmented and there are collapsed regions. So I combined hap1 and hap2, and then run with 3ddna. It seems to work well from the results, I wonder if my way of combining hap1 and hap2 to go to mount is appropriate?

chhylp123 commented 11 months ago

HiC phased triploid assembly is still tricky. If --n-hap 3 doesn't work well, could you please have a try with the normal diploid assembly, and then take 3d-dna to mannually fix the duplications?

monian1113 commented 11 months ago

Much thanks, I think there may also be a problem with my understanding of the “hom cov”, when I change the parameter to "--n-hap 3 --hom-cov 51", the total size is as expected but there are indeed duplicates, which occasionally occurs when I am using the diploid mode of 0.16.1-r375, utilizing "hap1+hap2 " mounted, and I wonder about the possible reasons for this occurrence? 微信图片_20230808093039

Overall, i think there are four options now: which one do you recommend more?

  1. "0.16.1-r375's p-utg", which is very fragmented, with a large number of collapsed regions;
  2. "0.16.1-r375's hap1+hap2 ", with localized duplications;
  3. "p-utg of 0.19-5", which is very fragmented too, and much larger in size than "p-utg of 0.16.1-r375";
  4. "hap1+hap2+hap3 of 0.19-5", with localized duplicates.