evotools / hapbin

Efficient program for calculating Extended Haplotype Homozygosity (EHH) and Integrated Haplotype Score (iHS)
GNU General Public License v3.0
41 stars 17 forks source link

Comparing hapbin to selscan, hapbin does not seem to print rows when the derived allele frequency is < 0.5 #22

Open stephenrong opened 8 years ago

stephenrong commented 8 years ago

Hapbin: ihsbin --hap test_model_9_T.hap1 --map test_model_9.pos --out hapbin_test --bin 1.0 --scale 20000

300 -0.431015 361 -0.119395 418 -0.194913 428 -0.157514 513 0.313023 532 -0.367884 567 -0.0234077 587 0.484128 642 -0.414727 644 -0.414727

Selscan: selscan --ihs --hap test_model_9.hap1 --map test_model_9.pos --out selscan_test --max-extend 0

300 47160 0.57 0.0135513 0.020853 -0.431015 361 58110 0.52 0.0224713 0.025321 -0.119395 413 67580 0.47 0.0231002 0.0144292 0.470588 418 68210 0.52 0.0160354 0.0194863 -0.194913 423 69135 0.48 0.0196922 0.015973 0.20932 426 69455 0.47 0.018845 0.0160986 0.157514 428 69635 0.53 0.0160986 0.018845 -0.157514 453 73110 0.39 0.0286909 0.0170902 0.518075 471 78110 0.35 0.0311701 0.0209462 0.397501 503 85455 0.47 0.0278055 0.0180986 0.429395 506 85655 0.47 0.0278055 0.0182261 0.42238 512 86340 0.42 0.0272794 0.0183248 0.397875 513 86470 0.5 0.0249246 0.0182257 0.313023 519 87430 0.39 0.024509 0.0152977 0.471343 532 88865 0.62 0.0199917 0.0288815 -0.367884 541 90400 0.35 0.0386135 0.0183309 0.745014 542 90480 0.35 0.0386135 0.0183309 0.745014 567 95705 0.65 0.0186017 0.0190422 -0.0234077 587 99155 0.6 0.023186 0.0152797 0.417022 617 106895 0.4 0.0247563 0.0208394 0.172231 642 109905 0.6 0.0206806 0.0319878 -0.436158 644 110460 0.6 0.0206806 0.0319878 -0.436158

Hapbin fails to print lines for which the third column (the DAF) is less than 0.5.

camaclean commented 8 years ago

This is caused by selscan and hapbin treating cutoffs slightly differently. This results in a handful of points at the edges of chromosomes which are included by selscan and not hapbin and vice versa. There are also a handful of points throughout the chromosome which will differ in value. There are things which can cause this:

Selscan reaches the end of its scan window and hapbin does not limit the scan range. Hapbin reaches the end of the chromosome, and those are discarded.

The other difference is cause by floating point imprecision. Because selscan and hapbin calculate the EHH differently and floating point numbers are inherently imprecise, they can be imprecise in different ways. With a cutoff at EHH <= 0.5, there can be inconsistencies if the EHH calculation actually results in ~0.5000000000000001 or ~0.4999999999999999 due to the nature of floats. Hapbin fixes this problem by checking for EHH <= cutoff + 1e-15. However, selscan doesn't account for floating point imprecision. As a result, it may reach the end of the chromosome if the EHH has decayed to 0.5 exactly but stores ~0.5000000000000001 and doesn't cut off when it should. This is what causes the handful of inconsistencies throughout the chromosome as well, since selscan doesn't always stop at 0.5.

stephenrong commented 8 years ago

Thanks for the response!

I'm still not getting why hapbin would be missing when the derived allele frequency of the core SNP at which iHS is being computed is less than 0.5 (different from the EHH cutoff of 0.05).

camaclean commented 8 years ago

It applies to any EHH cutoff and it results in selscan missing some values. When hapbin is missing values, it is because of the window reason.

prenderj commented 8 years ago

Hi Both

This sounds like it may be a different issue to what Colin is describing. Can I check you mean to set the frequency bin size to 1.0 with hapbin? The iHS test involves normalising scores according to allele frequency bins. By setting this to 1 you are in effect removing this normalistion, as setting bin size to 100%. This isnt something you would normally do so may be causing some odd behaviour. If are able to send some of your test data can check.

Cheers James

stephenrong commented 8 years ago

So I set that to 1.0 because I don't want normalized scores (I want to do my own normalization downstream). But now I see that by doing that, it still affects the unstandardized outputs in the output file without the .std extension. With the default values, I DO get the same results as selscan.

camaclean commented 8 years ago

Oh, sorry for my confusion. I see the problem, now. Something is getting messed up with the binning. James, does your Perl script version bin to the nearest frequency? That's what hapbin is doing and erroneously rejecting the "0.0" bin. However, sescan's binning algorithm seems to round up to the next highest bin. What I don't understand is that we tested the standardized output between selscan and hapbin and they gave the same results even though it appears the binning algorithm is different. Maybe I'm not seeing something.

As far as not erroneously rejecting results goes, I've fixed that, but I still need to do more investigation on the binning algorithm itself.

camaclean commented 8 years ago

For now, you can get the unnormalized data by running hapbin without setting --bin and using the non-.std output file.