phase block number compared to Whatshap

MichelMoser commented 2 years ago

Hello, thank you for the great tool!

I was just testing HapDup v0.7 on our fish genome. Comparing the output with phasing done with WhatsHap (WH), I wondered why there is such a big difference in phased block size and block number between HapDup and the WH pipeline?

For the fish chromosomes, WH was generating 679 blocks using 2'689'114 phased SNPs. Margin (HapDup pipeline) was generating 5352 blocks using 3'862'108 phased SNPs.

The main difference seems to be the prior read filtering and usage of MarginPhase for the phasing in HapDup, but does this explain such a big difference?

I was wondering if phase blocks of HapDup could be concatenated using whatshap SNP and block information to increase continuity? I imagine it would be a straightforward approach overlapping SNP positions between Margin and WH with phase block ids and lift-over phase ids from WH. I will do some visual inspections and scripting to test if there is overlap of called SNPs and agreement on block boarders.

Cheers, Michel

mikolmogorov commented 2 years ago

Hi @MichelMoser

It's also important to compare total block length. + length distribution (e.g. N50). Simply a number of blocks may be misleading, as there may be a lot of short blocks.

What is your input for WhastHap? If you see examples of long WH phased blocks that are fragmented in Hapdup (Margin), I'll be happy to take a look.

Mikhail

MichelMoser commented 2 years ago

Hi @fenderglass ,

Thank you for the reply. Since this is not a real software issue, i am happy to move it somewhere else if needed. Here a few more informative numbers. Margin looks clearly more fragmented. Of course i would be tempted to replace Margin with WH, but not sure if this would break the whole pipeline :)

Input for WhatsHap was the same alignments as for HapDup. SNPs were called using whatshap find_snv_candidates with default params.

I will make a graphical overview of fragment distribution over the chromosomes and inspect some fragments which got merged in WH with IGV.

chroms   blocks  variant_per_block_median        bp_per_block_median     bp_per_block_avg        bp_per_block_max    bp_per_block_sum        block_n50

WhatsHap
ssa01     43      635                             494638             4038662.6511627906             36805625       173662494                   21114821
ssa02     43      1242                            774167             2203630.906976744              27378881        94756129                    9853599
ssa03     28      663.0                          1009984             3757631.714285714              48793029       105213688                   10416786
ssa04     32      1605.5                          748402             2816167.65625                  14465786        90117365                    5158522

Margin
ssa01   276        995.8152173913044             206565               611797.6340579711              6798435       168856147                 1710363
ssa02   281        510.117437722419               95564               316866.23131672596             5680168        89039411                 1197005
ssa03   226        730.0442477876106             147522               449200.9646017699              9219172       101519418                 1218426
ssa04   179        803.608938547486              120437               485092.16759776534             4853353        86831498                 1796523

mikolmogorov commented 2 years ago

Sorry for the late response and thanks for the info! Do you have heterozygosity rate estimates for these genomes? Based on your table, it seems to be around 0.001 (I'm dividing variant_per_block_median over bp_per_block_median).

0.001 is very similar to the human genome, and for human datasets with read coverage ~30x and N50 ~30kb, we typically get phased block at about 1Mb, so similar to your Margin numbers. WhatsHap's numbers are kinda too good to be true if you have similar ONT protocol. But if you have ultra-long reads, you can definitely achieve phased block N50 of around 20Mb.

So overall hard to tell exactly what is going on without some kind of ground truth or looking at the raw data..

KolmogorovLab / hapdup

phase block number compared to Whatshap #17