Open MichelMoser opened 2 years ago
Hi @MichelMoser
It's also important to compare total block length. + length distribution (e.g. N50). Simply a number of blocks may be misleading, as there may be a lot of short blocks.
What is your input for WhastHap? If you see examples of long WH phased blocks that are fragmented in Hapdup (Margin), I'll be happy to take a look.
Mikhail
Hi @fenderglass ,
Thank you for the reply. Since this is not a real software issue, i am happy to move it somewhere else if needed. Here a few more informative numbers. Margin looks clearly more fragmented. Of course i would be tempted to replace Margin with WH, but not sure if this would break the whole pipeline :)
Input for WhatsHap was the same alignments as for HapDup. SNPs were called using whatshap find_snv_candidates with default params.
I will make a graphical overview of fragment distribution over the chromosomes and inspect some fragments which got merged in WH with IGV.
chroms blocks variant_per_block_median bp_per_block_median bp_per_block_avg bp_per_block_max bp_per_block_sum block_n50
WhatsHap
ssa01 43 635 494638 4038662.6511627906 36805625 173662494 21114821
ssa02 43 1242 774167 2203630.906976744 27378881 94756129 9853599
ssa03 28 663.0 1009984 3757631.714285714 48793029 105213688 10416786
ssa04 32 1605.5 748402 2816167.65625 14465786 90117365 5158522
Margin
ssa01 276 995.8152173913044 206565 611797.6340579711 6798435 168856147 1710363
ssa02 281 510.117437722419 95564 316866.23131672596 5680168 89039411 1197005
ssa03 226 730.0442477876106 147522 449200.9646017699 9219172 101519418 1218426
ssa04 179 803.608938547486 120437 485092.16759776534 4853353 86831498 1796523
Sorry for the late response and thanks for the info! Do you have heterozygosity rate estimates for these genomes? Based on your table, it seems to be around 0.001 (I'm dividing variant_per_block_median
over bp_per_block_median
).
0.001 is very similar to the human genome, and for human datasets with read coverage ~30x and N50 ~30kb, we typically get phased block at about 1Mb, so similar to your Margin numbers. WhatsHap's numbers are kinda too good to be true if you have similar ONT protocol. But if you have ultra-long reads, you can definitely achieve phased block N50 of around 20Mb.
So overall hard to tell exactly what is going on without some kind of ground truth or looking at the raw data..
Hello, thank you for the great tool!
I was just testing HapDup v0.7 on our fish genome. Comparing the output with phasing done with WhatsHap (WH), I wondered why there is such a big difference in phased block size and block number between HapDup and the WH pipeline?
For the fish chromosomes, WH was generating 679 blocks using 2'689'114 phased SNPs. Margin (HapDup pipeline) was generating 5352 blocks using 3'862'108 phased SNPs.
The main difference seems to be the prior read filtering and usage of MarginPhase for the phasing in HapDup, but does this explain such a big difference?
I was wondering if phase blocks of HapDup could be concatenated using whatshap SNP and block information to increase continuity? I imagine it would be a straightforward approach overlapping SNP positions between Margin and WH with phase block ids and lift-over phase ids from WH. I will do some visual inspections and scripting to test if there is overlap of called SNPs and agreement on block boarders.
Cheers, Michel