c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
130 stars 19 forks source link

phased haplotype scaffolding #10

Closed m-jahani closed 2 years ago

m-jahani commented 2 years ago

Hi

I am trying to scaffold two separate phased haplotype assemblies of the same diploid plant genome (assembled with HIFIasm). For that, I am using the same HiC data for the scaffolding process of each haplotype assembly. However, there are big differences between the two haplotypes scaffolding result.

There are likely rearrangements between the two homologous chromosomes for heterozygous genomes, whereby mapping all HiC data to each assembly could confuse that signal. Therefore, I wonder if there is an approach to extract haplotype-specific HiC reads and run the scaffolder with the specific reads to the haplotype (vs. all reads).

Any feedback you can give me on this would be greatly appreciated.

Thanks, Mojtaba

c-zhou commented 2 years ago

Hi Mojtaba,

Thanks for using yahs. I do not really have any experience with this hic binning problem. What comes to my mind immediately is to map the hic reads to two haplotypes simultaneously. You might see some read pairs uniquely mapped to one haplotype. I guess you can also pull out those read pairs with only one read uniquely mapped. How well this will work really depends on how divergent the two haplotypes are.

Another idea that comes to my mind is to use kmer binning borrowing the trio-binning idea from Hifiasm or HiCanu. You can build two haplotype-specific kmer databases and use them to split hic reads. I know HiCanu has a splitHaplotype program that takes meryl database as input. In terms of how to build haplotype-specific meryl databases, you can check this page https://github.com/marbl/merqury to get some ideas.

Best, Chenxi