Homozygous diploid trio assembly assigning largely ambiguous nodes to only one haplotype

ASLeonard commented 11 months ago

Hi,

I'm assembling a 3 Gb diploid mammal, but we know it is likely to be quite homozygous. This is confirmed by the k-mer peak, and hifiasm "correctly" identifies the peaks as [M::ha_pt_gen] peak_hom: 46; peak_het: 24.

However, I noticed the haplotype-resolved assemblies (using parental k-mers from yak) are quite unbalanced at times. Below is the dip.p_utg.gfa graph. There are for example 2 large nodes (yellow and green) that are given to both hap1 and hap2. These nodes have a similar amount of maternal (m) or paternal (p) assigned reads, or all ambiguous (a). However, the blue and red nodes are only present in hap2/maternal and near completely missing from hap1/paternal, despite the fact they are the single entry/exit nodes. The yak k-mers suggest these nodes are slightly more maternal than paternal, but are clearly overwhelmingly ambiguous, but are not assigned to both haplotypes. Is this the expected behaviour, or should the blue and red nodes (with ambiguous reads >> maternal reads) be assigned to both haplotypes?

I'm not sure if changing --hom-cov would help since the peak is at 46, and since it is a trio -s/-l aren't on by default.

chhylp123 commented 11 months ago

What's the coverage of these nodes? Are they het or hom?

ASLeonard commented 11 months ago

They are fairly hom. The larger nodes have depth between 37 and 41, so definitely would not expect only one haplotype (hap2 in this case) to get those sequences. For reference, the red, blue, green, and orange nodes are roughly 8.3, 5.9, 3.0, and 3.3 Mb, so hap2 missing the red and blue nodes means the chromosome is missing 14 out of 65 Mb.

ASLeonard commented 11 months ago

I tried with setting --hom-cov 34 since those nodes were covered under the originally estimated peak of 46, but the results were still roughly the same with some "homozygous" clearly not assigned.

chhylp123 commented 11 months ago

Sorry I missed your last reply. Could you please have a try with --trio-dual? In this mode, hifiasm will joint consider both homologous and trio information. It seems the trio information is not such reliable for your sample.

ASLeonard commented 11 months ago

I tried with --trio-dual and the results are quite similar with large regions still missing from one haplotype. I don't think it is straightfoward to follow the dip.p_utg.gfa gets split into the two haplotype graphs, but is something like this line in rcut responsible for assigning tigs? I still just don't see how a single-entry single-exit node can eventually end up in only one haplotype

chhylp123 commented 11 months ago

I think the main issue here is that it seems the trio-binning has issue. Could you please have a try without trio-binning? If the dual-assembly has two balanced haplotypes, I will add a new mode to fix this issue. Thanks so much!

ASLeonard commented 11 months ago

This a summary of reference-scaffolded chromosomes for some of the different modes discussed. The dual assemblies were taken from <sample>.bp.hap1.p_ctg.gfa etc. There is still some extreme variation even in the dual mode. Chr 17 was the initial example I gave, and is 80 vs 50 Mb for dual1 and dual2.

Here is the entire chromosome 17 from the <sample>.dip.p_utg.gfa coloured by depth. Except for the two ends in the middle (4 and 9 Mb respectively), there don't appear to be any bubbles that could lead to large size discrepancies between the haplotypes.

chhylp123 commented 11 months ago

I see. Then would you mind to share the bin files with me? With the data I could figure out what really happened for your sample.

ASLeonard commented 11 months ago

The total files are ~60 Gb so above the hosting limit I have. I've uploaded the asm.ec.bin for now, and once you confirm you have downloaded that I'll remove that and add the other bin files.

chklopp commented 10 months ago

Hi,

Thank you for hifiasm, great tool! I'm having a similar problem with a bovine genome assembly in which 200Mb are missing from haplotype 1 and the dot plot with the reference shows missing portions in many chromosomes. Did you add a new mode to fix this issue?

chhylp123 / hifiasm

Homozygous diploid trio assembly assigning largely ambiguous nodes to only one haplotype #564