Open ASLeonard opened 11 months ago
What's the coverage of these nodes? Are they het or hom?
They are fairly hom. The larger nodes have depth between 37 and 41, so definitely would not expect only one haplotype (hap2 in this case) to get those sequences. For reference, the red, blue, green, and orange nodes are roughly 8.3, 5.9, 3.0, and 3.3 Mb, so hap2 missing the red and blue nodes means the chromosome is missing 14 out of 65 Mb.
I tried with setting --hom-cov 34
since those nodes were covered under the originally estimated peak of 46, but the results were still roughly the same with some "homozygous" clearly not assigned.
Sorry I missed your last reply. Could you please have a try with --trio-dual
? In this mode, hifiasm will joint consider both homologous and trio information. It seems the trio information is not such reliable for your sample.
I tried with --trio-dual
and the results are quite similar with large regions still missing from one haplotype. I don't think it is straightfoward to follow the dip.p_utg.gfa gets split into the two haplotype graphs, but is something like this line in rcut responsible for assigning tigs? I still just don't see how a single-entry single-exit node can eventually end up in only one haplotype
I think the main issue here is that it seems the trio-binning has issue. Could you please have a try without trio-binning? If the dual-assembly has two balanced haplotypes, I will add a new mode to fix this issue. Thanks so much!
This a summary of reference-scaffolded chromosomes for some of the different modes discussed. The dual assemblies were taken from <sample>.bp.hap1.p_ctg.gfa
etc. There is still some extreme variation even in the dual mode. Chr 17 was the initial example I gave, and is 80 vs 50 Mb for dual1 and dual2.
Here is the entire chromosome 17 from the <sample>.dip.p_utg.gfa
coloured by depth. Except for the two ends in the middle (4 and 9 Mb respectively), there don't appear to be any bubbles that could lead to large size discrepancies between the haplotypes.
I see. Then would you mind to share the bin files with me? With the data I could figure out what really happened for your sample.
The total files are ~60 Gb so above the hosting limit I have. I've uploaded the asm.ec.bin for now, and once you confirm you have downloaded that I'll remove that and add the other bin files.
Hi,
Thank you for hifiasm, great tool! I'm having a similar problem with a bovine genome assembly in which 200Mb are missing from haplotype 1 and the dot plot with the reference shows missing portions in many chromosomes. Did you add a new mode to fix this issue?
Hi,
I'm assembling a 3 Gb diploid mammal, but we know it is likely to be quite homozygous. This is confirmed by the k-mer peak, and hifiasm "correctly" identifies the peaks as
[M::ha_pt_gen] peak_hom: 46; peak_het: 24
.However, I noticed the haplotype-resolved assemblies (using parental k-mers from yak) are quite unbalanced at times. Below is the dip.p_utg.gfa graph. There are for example 2 large nodes (yellow and green) that are given to both hap1 and hap2. These nodes have a similar amount of maternal (m) or paternal (p) assigned reads, or all ambiguous (a). However, the blue and red nodes are only present in hap2/maternal and near completely missing from hap1/paternal, despite the fact they are the single entry/exit nodes. The yak k-mers suggest these nodes are slightly more maternal than paternal, but are clearly overwhelmingly ambiguous, but are not assigned to both haplotypes. Is this the expected behaviour, or should the blue and red nodes (with ambiguous reads >> maternal reads) be assigned to both haplotypes?
I'm not sure if changing
--hom-cov
would help since the peak is at 46, and since it is a trio-s
/-l
aren't on by default.