chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
534 stars 87 forks source link

Highly heterozygous genome phased by hifiasm, wondered added value of trio-binning? #46

Closed zhaotao1987 closed 3 years ago

zhaotao1987 commented 4 years ago

hifiasm.log.txt Hi,

Thanks for the amazing tool, I'm a bit new to genome assembling, and have just tried hifiasm using public SRA data. Using default settings I've got seemingly very nice assemblies. The total size of the p_ctg is very reasonable (693M) and the N50 is over 30M, the species is apple, as we know it's highly heterozygous, my alternative assembly (a_ctg) is around 589M, which I think is also quite reasonable. You can see two obvious peaks from the kmer distribution (Please help to have a quick look at the attached working log to see if everything is okay, thanks!).

My first question is does p_ctg represent one full haplotype? As I understand p_ctg + a_ctg + collapsed homozygous regions = 2c ? How can I generate another haplotype consisting of a_ctg and the collapsed homozygous regions? My second question is how much improvements can be made to the assembly if short illumina reads for the parents also available? (I think it's already quite good, maybe some other added-value of using trio-binning?) Although sometimes it's hard for us to know the exact parents for some of the species being sequenced (especially the wild ones).

Thank you so much.

chhylp123 commented 4 years ago

P_ctg consists of one set of haplotype but its contigs might be switched between two haplotypes. I guess it is usually used as reference. If the paternal short reads are available, hifiasm can output fully phased assembly, i.e., two fully phased haplotypes. We think fully phased assembly should be also generated with HiC or Strand-seq if paternal data is not available, but for now hifiasm itself cannot automatically do that. If you can get phased partitioned list of HiFi reads with HiC, then feeding it to hifiasm using option '-3/-4' can also output fully phased assembly.

zhaotao1987 commented 4 years ago

@chhylp123 Thanks very much for the reply! I see, some contigs might be switched.. btw, P_ctg is one full set of halpotype, but a_ctg is not, I think it contains only the alternative regions, I wondered how can I add the collapsed consensus regions into a_ctg as well, then I could obtain a full set of alternative haplotype (?).

chhylp123 commented 4 years ago

It is hard. If you'd like to do it manually, probably: 1) run hifiasm with -l0, 2) find homozygous unitigs at r_utg with homozygous coverage, 3) add these unitigs to a_ctg.