evotools / hapbin

Efficient program for calculating Extended Haplotype Homozygosity (EHH) and Integrated Haplotype Score (iHS)
GNU General Public License v3.0
41 stars 18 forks source link

Potential bug in hapbin ihsbin #33

Open hardingnj opened 7 years ago

hardingnj commented 7 years ago

Hi there,

I wanted to verify I was getting correct answers from my hapbin iHS analysis, so I designed a minimal test case for iHS data.

Briefly, the dataset consists of 19 SNPs, where the haplotype structure is symmetrical around the 10th central SNP. This SNP has 5 ancestral allele haplotypes, which decay quickly by one haplotype per SNP. And 5 derived SNP haplotypes, which maintain their haplotype for 4 SNPs, before decaying at exactly the same rate as the ancestral.

The genetic distance is uniform, with 1 unit per SNP.

I manually calculated the iHS here to be log((4 + 4 + 1.5 + 1.5)/(1.5 + 1.5)) = 1.29928.

However, when I run this dataset through ihsbin, it fails to give me a value for this SNP.

The datasets are available from here: haps: https://www.dropbox.com/s/47vm2aq7ny7yj7v/haplotypes.hap?dl=0 snps: https://www.dropbox.com/s/czobf28d76ls0d0/snp_desc.map?dl=0

Selscan provided the expected answer of 1.29928 (following transposing the hap matrix 90 degrees). I have ran this using another tool (scikit-allel: http://scikit-allel.readthedocs.io/en/latest/stats/selection.html) and also get the expected result.

(transposed hap matrix for selscan) https://www.dropbox.com/s/4avbacu1qbz3lpj/haplotypes_selscan.hap?dl=0

The command line I used was:

./ihsbin \
  --hap haplotypes.hap \
  --map snp_desc.map \
  --out ihs_result \
  > ihs_result.log

and for selscan:

./selscan --ihs
  --hap haplotypes_selscan.hap \
  --map snp_desc.map \
  --out selscan_ihs

This may be related to #22, but something is amiss here I think? Regardless, this should provide a good test case.

Thanks

camaclean commented 7 years ago

Hi,

Remember to use --alt with selscan. Hapbin uses frequency squared, not binomial coefficients for EHH.

I did some digging and it looks like selscan has an off by one error for ehh. Using selscan, if you generate the ehh for rs8, you get the tree for the central locus, rs9.

ihsbin, isn't outputting the result because the standardized score is nan. However, it works out correctly with ehhbin:

rs1  2  1055 0.2  0.2   5*(1/5)^2=0.2                      5*(1/5)^2=0.2
rs2  3  1111 0.28 0.2   (2/5)^2+3*(1/5)^2=0.16+3*0.04=0.28 5*(1/5)^2=0.2
rs3  4  1166 0.44 0.2   (3/5)^2+2*(1/5)^2=0.36+2*0.04=0.44 5*(1/5)^2=0.2
rs4  5  1222 0.68 0.2   (4/5)^2+(1/5)^2=0.64+0.04=0.68     5*(1/5)^2=0.2
rs5  6  1277 1    0.2   (5/5)^2=1                          5*(1/5)^2=0.2
rs6  7  1333 1    0.28  (5/5)^2=1                          (2/5)^2+3*(1/5)^2=0.16+3*0.04=0.28
rs7  8  1388 1    0.44  (5/5)^2=1                          (3/5)^2+2*(1/5)^2=0.36+2*0.04=0.44
rs8  9  1444 1    0.68  (5/5)^2=1                          (4/5)^2+(1/5)^2=0.64+0.04=0.68
rs9  10 1500 1    1     (5/5)^2=1                          (5/5)^2=1
rs10 11 1555 1    0.68  (5/5)^2=1                          (4/5)^2+(1/5)^2=0.64+0.04=0.68
rs11 12 1611 1    0.44  (5/5)^2=1                          (3/5)^2+2*(1/5)^2=0.36+2*0.04=0.44
rs12 13 1666 1    0.28  (5/5)^2=1                          (2/5)^2+3*(1/5)^2=0.16+3*0.04=0.28
rs13 14 1722 1    0.2   (5/5)^2=1                          5*(1/5)^2=0.2
rs14 15 1777 0.68 0.2   (4/5)^2+(1/5)^2=0.64+0.04=0.68     5*(1/5)^2=0.2
rs15 16 1833 0.44 0.2   (3/5)^2+2*(1/5)^2=0.36+2*0.04=0.44 5*(1/5)^2=0.2
rs16 17 1888 0.28 0.2   (2/5)^2+3*(1/5)^2=0.16+3*0.04=0.28 5*(1/5)^2=0.2
rs17 18 1944 0.2  0.2   5*(1/5)^2=0.2                      5*(1/5)^2=0.2
iHS: 0.76214
MAF: 0.5

4*(1+1)/2 + (1+0.68)/2+(0.68+0.44)/2+(0.44+0.28)/2+(0.28+0.2)/2
4         + 0.84      +0.56         +0.36         +0.24
4+2
6

(1+0.68)/2+(0.68+0.44)/2+(0.44+0.28)/2+(0.28+0.2)/2 + 4*0.2
0.84      +0.56         +0.36         +0.24         + 0.8
2+0.8
2.8

ln((6+6)/(2.8+2.8))=0.76214
hardingnj commented 7 years ago

Thanks for the super quick reply- much appreciated. I wasn't aware of 2 different methods for calculating EHH. That would explain a lot!

EHH is the probability that two haplotypes drawn from the distribution are identical (at least from Sabeti). So according to the frequency squared method, you can never decay below EHH=0.2 with 5 core haplotypes? That seems a bit strange to me. I guess the difference is that the freq sq method has the idea of sampling with replacement, so then 0.2 would make sense. I would expect the 2 methods to be at their most different when the frequency of the core SNP is either low or high?

Additionally should ihsbin not report values even if no standardization possible (due to n=1)? I guess it's a moot point for analyses, but handy for test cases.

Thanks again

hardingnj commented 7 years ago

Hi,

Sorry to bug you again. I ran the same data using xpehhbin. Splitting the first 5 haplotypes (ancestral) and the second 5 haplotypes (derived) into popA and popB.

The xpehh score should match the ehh score for this SNP, ie 0.76214 as established above.

Running with: xpehhbin --hapApopA.hap --hapB popB.hap --map snp_desc.map --out result --minmaf 0.05 -c 0.05

gives:

Location        iHH_A1  iHH_B1  iHH_P1  XPEHH
rs0     6.2     3       3.75    0.725937
rs9     3.3     6.5     2.45    -0.67788
rs10    3.66    7.5     2.89    -0.71744
rs11    3.46    8.5     3.29    -0.898798
rs12    3.18    9.5     3.77    -1.09441
rs14    4.3     8.3     3.45    -0.65764
rs15    5.3     6.18    3.37    -0.153611
rs18    6.2     3       3.75    0.725937

This result just looks a bit odd to me, surely the data should be symmetrical on the SNPs reported? Additionally, the rs9 SNP doesn't agree with ihsbin. Should we even be getting any data here, as the ehh will never decay to 0.05? Unless you calculate EHH across all samples to determine the cutoff position?

I increased the cutoff, as I thought that might explain the difference observed with rs9.

xpehhbin --hapApopA.hap --hapB popB.hap --map snp_desc.map --out result --minmaf 0.05 -c 0.2

gives:

Location        iHH_A1  iHH_B1  iHH_P1  XPEHH
rs0     5.76    2.6     3.5     0.795426
rs1     4.76    1.54    2.535   1.12847
rs7     4.4     11.16   4.69    -0.930731
rs8     4.74    11.06   4.38    -0.847298
rs9     4.8     10.8    3.9     -0.81093
rs10    4.48    10.8    4.25    -0.879923
rs11    4.4     11.16   4.69    -0.930731
rs17    4.76    1.54    2.535   1.12847
rs18    5.76    2.6     3.5     0.795426

This looks more sensible, but rs8 and rs10 have slightly different values, when they should be the same?

Also, the rs9 value is a bit off when compared to the ehh value; any idea of the reason for this?

camaclean commented 7 years ago

Sorry it took me so long to get back to you. I was on vacation, then forgot for a couple days. This should be fixed in Pull #34.

I also added binomial coefficient support. Just use --binom. Keep in mind that for this test the combined-population EHH falls to 0.0222222 before it goes to 0.0, so -c 0 is necessary to go all the way.

hardingnj commented 7 years ago

Much appreciated- thanks!