Fisher reports wrong results on some edge cases (overflow?) - #2

michael-imbeault commented 4 years ago

A while ago I reported an issue where in some edge cases the fisher functionality of bedtools was reporting erroneous results - https://github.com/arq5x/bedtools2/issues/343

It was fixed, but we have now detected another case of the same behavior. Basically, extremely significant results get assigned 1.0 as a p-value, instead of a very small one - confirmed by using various other ways of calculating the fisher exact test p-value (such as scipy.stats.fisher_exact) - and also using common sense.

Similar as before, it only occurs when a few of the terms in the contingency table are very large. For example:

 Contingency Table Of Counts
_________________________________________
           |  in -b       | not in -b    |
     in -a | 333          | 381          |
 not in -a | 801722       | 7664285      |
_________________________________________
 p-values for fisher's exact test
left    right   two-tail    ratio
1   2.802e-145  2.802e-145  8.355

Scipy results:
pvalue, less: 1.0
pvalue, greater: 2.8020434544673533e-145
pvalue, two-sided: 2.802043454467353e-145

This is correct. However:

 Contingency Table Of Counts
_________________________________________
           |  in -b       | not in -b    |
     in -a | 4155         | 4903         |
 not in -a | 805463       | 8507517      |
_________________________________________
 p-values for fisher's exact test
left    right   two-tail    ratio
0   1   0   8.951

Scipy results:
pvalue, less: 1.0
pvalue, greater: 2.8020434544673533e-145
pvalue, two-sided: 2.802043454467353e-145

Is obviously wrong. Enrichment ratios calculated by both are the same in all cases, so that part is fine - it only affects the p-value calculations.

This is a very significant problem, as hits get reported as negatives. At this point we are not trusting bedtools fisher anymore and always double-run scipy fisher_exact to make sure the results are correct - I think bedtools should have unit tests to certify that this function is behaving correctly, and I think the current implementation logic is too fragile / susceptible to under / overflows.

joshwilding4444 commented 4 years ago

@michael-imbeault Do you have files with which I can replicate this issue? In the meantime, I will write some unit tests based on these tables and let you know the results.

joshwilding4444 commented 4 years ago

Looking at the source code, it seems the offending function is: kt_fisher_exact(long long n11, long long n12, long long n21, long long n22, double *_left, double *_right, double *two) declared in src/fisher/kfunc.h and implemented in src/fisher/kfunc.cpp.

The following is a quick unit test I wrote. Any suggestions or ideas on where the kt_fisher_exact function is failing are welcome. I will post again if I can see any problems or figure out an appropriate correction.

#include "kfunc.h"
#include <iostream>

/*
 * Unit test for kfunc.h
 */

bool kt_fisher_exact_unit_test() {
  // very significant test values. Should give a result
  // of left p-value = 1.0, and right and two-tailed
  // p-values close to 0.
  long long inAinB = 4155;
  long long inAnotB = 4903;
  long long notAinB = 805463;
  long long notAnotB = 8507517;
  double left = 0.0;
  double right = 0.0;
  double two = 0.0;
  double result = kt_fisher_exact(inAinB, inAnotB, notAinB,
                  notAnotB, &left, &right, &two);
  std::cout << "# Contingency Table Of Counts" << std::endl;
  printf("#_________________________________________\n");
  printf("#           | %-12s | %-12s |\n", " in -b", "not in -b");
  printf("#     in -a | %-12lld | %-12lld |\n", inAinB, inAnotB);
  printf("# not in -a | %-12lld | %-12lld |\n", notAinB, notAnotB);
  printf("#_________________________________________\n");
  double ratio = ((double)inAinB / (double)inAnotB) / ((double)notAinB / (double)notAnotB);
  printf("# p-values for fisher's exact test\n");
  printf("left\tright\ttwo-tail\tratio\n");
  if(std::isnan(ratio)) {
      printf("%.5g\t%.5g\t%.5g\t-nan\n", left, right, two);
  } else {
      printf("%.5g\t%.5g\t%.5g\t%.3f\n", left, right, two, ratio);
  }

  bool leftResult = false;
  bool rightResult = false;
  bool twoTailResult = false;
  //return true if and only if all 3 values are in the correct range.
  if(left == 1) {
    leftResult = true;
  }

  if(right < 0.5) {
    rightResult = true;
  }

  if(two < 0.5) {
    twoTailResult = true;
  }

  return ( leftResult && rightResult ) && twoTailResult;
}

int main() {
  if(kt_fisher_exact_unit_test()) {
    std::cout<<"Test passed!"<<std::endl;
  } else {
    std::cout<<"Test failed!"<<std::endl;
  }
}

michael-imbeault commented 4 years ago

I could send files by email if needed, but I'm sure it can be tested internally (just the function).

If unit tests are implemented for this they should cover the most extreme values possible to ensure no overflow.

joshwilding4444 commented 4 years ago

You're correct. I have used the numbers you gave in the initial post to determine that the problem is likely in the functions hypergeo() and hypergeo_acc() implemented in kfunc.cpp. I have written tests for those functions. I am currently figuring out why large values give incorrect results in the hypergeometric distribution. The hypergeo_acc() has its own iterative steps before calling hypergeo(). The implementation for creating a hypergeometric distribution in kfunc.cpp involves a binomial function:

// hypergeometric distribution
 double hypergeo(long long n11, long long n1_, long long n_1, long long n)
{
    return exp(lbinom(n1_, n11) + lbinom(n-n1_, n_1-n11) - lbinom(n, n_1));
}

The function lbinom() is defined in the same file:

 double lbinom(long long n, long long k)
{
    if (k == 0 || n == k) return 0;
    return lgamma(n+1) - lgamma(k+1) - lgamma(n-k+1);
}

Where lgamma is the gamma function from cmath. I will be honest in that it has been a while since my last statistics class, so I'm not sure if there is anything mathematically incorrect here. These functions give correct results for low values as far as I can tell. You're probably correct in that there is some kind of numerical overflow problem here, so there may need to be adjustment involving the data types. I'm open to any suggestions on how best to resolve this.

arq5x commented 4 years ago

@brentp and I had discussed some of the instability in fisher and are favoring deprecation. @brentp do you have any ideas/comments about this discussion above?

brentp commented 4 years ago

I didn't find a way to overcome the overflows. I think that unless someone can step up and fix and maintain, then we should deprecate. There is limited utility in this approach despite my initial enthusiasm.

jmarshall commented 4 years ago

The kfunc.cpp code is copied from HTSlib (per PR #94) and the same issues are exhibited in current HTSlib. So if anyone has any insights I look forward to fixing kt_fisher_exact() in HTSlib too.

I've made a draft adding your test cases and various others to HTSlib's test suite: see PR samtools/htslib#1126. This also adds some debugging printfs to kt_fisher_exact() which suggest that the problem is not numeric overflow per se, so is not in hypergeo_acc()/etc or below.

Instead it appears that usually either i or j is exactly n11 and either left or right is adjusted accordingly. In the problematic test cases, neither i or j is close to n11 and the wrong one of left/right is adjusted. So fixing the adjustment choice test in these problematic cases would likely solve the problem. Unfortunately there is little hint as to how this code is intended to work — @lh3 can you shed any light on the background to this kt_fisher_exact() implementation?

brentp commented 4 years ago

when i copied it over, I changed the parameters to long long, so it differs at least from the original kfunc.c, but still suffers the same issues.

jmarshall commented 4 years ago

The situation in which neither i nor j is close to n11 arises when q is 0.0 — for these large input values, hypergeo()'s lbinom + lbinom - lbinom is negative and large i.e. ≤ -750 or so, hence exp() underflows and it and hypergeo() return 0.0.

See PR #856 which handles the q == 0.0 case separately and produces the right results for @michael-imbeault's examples and various other large value test cases. However this code is opaque enough that I am not qualified to say whether it is entirely “right” or not…

joshwilding4444 commented 4 years ago

Running the unit test from earlier, I am now getting correct results with the new code. Thanks, @jmarshall . Have you had an opportunity to look at this @michael-imbeault?

michael-imbeault commented 3 years ago

If the corrected version produces the expected results that's all I can ask for - the logic make sense (as far as I understand it). It wouldn't hurt to test the most extreme scenarios that could theoritically happen (values as big as the number of bases in the human genome, for example) just to make sure I don't have to open #3 in a few months :)

arq5x / bedtools2

Fisher reports wrong results on some edge cases (overflow?) - #2 #855