estebanz01 / ruby-statistics

Ruby gem for some statistical operations without any statistical language dependency
MIT License
109 stars 16 forks source link

added chi square independence test for contingency tables #115

Closed oliver-czulo closed 8 months ago

oliver-czulo commented 12 months ago

Added functions to test contingency tables for independence of observations. Tested for plausibility for 2x2 and 3x3 tables

estebanz01 commented 10 months ago

Happy new year! 🎉 I'm back so I'm going to give another pass and let you know.

estebanz01 commented 10 months ago

@oliver-czulo can you push and update your branch with latest master, please? for some reason I'm not seeing your latest changes 😀

oliver-czulo commented 10 months ago

Hi, just pushed the changes. Added tests, but failed to run them locally :(

There seems to be a problem with line 57 in lib/statistics/statistical_test/chi_squared_test.rb which reads probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score): The expected values and the chi_score I calculate match those of the Python class I compare with, but this line yields the value 1.0000333200206522 insted of 1.469045936431957e-25 (as from the Python class).

estebanz01 commented 10 months ago

Hi, just pushed the changes. Added tests, but failed to run them locally :(

There seems to be a problem with line 57 in lib/statistics/statistical_test/chi_squared_test.rb which reads probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score): The expected values and the chi_score I calculate match those of the Python class I compare with, but this line yields the value 1.0000333200206522 insted of 1.469045936431957e-25 (as from the Python class).

thanks! I'll take a look. Regarding tests, I use R as a source of truth, so I'll do a check there instead of python for this.

estebanz01 commented 10 months ago

I read the matrix completely backwards 😁 here's the results from R:

> y <- as.table(matrix(c(388,51692, 119,45633, 271,40040), ncol=2, byrow=TRUE))
> y
      A     B
A   388 51692
B   119 45633
C   271 40040
> summary(y)
Number of cases in table: 138143 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 114.3600283152, df = 2, p-value = 1.46904593643e-25
> chisq.test(y, correct = F, rescale.p = F)

    Pearson's Chi-squared test

data:  y
X-squared = 114.3600283152, df = 2, p-value < 2.220446049e-16

I'll check the calculations again on the ruby side.

oliver-czulo commented 10 months ago

I read the matrix completely backwards 😁

Might be due to a difference in how the R functions and the Ruby & Python functions read the tables 😊 It's also not consistent in the "literature"; for comparison, the following example seems to build the contingency tables in R style, unlike other examples: https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/

I'll check the calculations again on the ruby side.

The one difference between the R calculation and the Ruby code seems to originate from the call in what is now line 57: probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score), i.e. it seems the cumulative_function is returning something wrong.

///

Changes for the latest push:

Currently, I don't see further tasks on my side for this class, do let me know in case I have overlooked anything. I'll look into the cumulative_function to see whether I can identify anything.

estebanz01 commented 10 months ago

I'll look into the cumulative_function to see whether I can identify anything.

I found the culprit, python and R (when calling summary function on a table), are returning the calculation of the special case for the CDF of the chi squared function (as described in the wiki).

but now, this raises me a question on how it's really calculated the test of independence for contingency tables in the R implementation, because it's giving me a different p-value:

> chisq.test(matrix(c(388,51692, 119,45633, 271,40040), ncol = 2, nrow = 3), correct = F, rescale.p = F)

    Pearson's Chi-squared test

data:  matrix(c(388, 51692, 119, 45633, 271, 40040), ncol = 2, nrow = 3)
X-squared = 134854.9483015367296, df = 2, p-value < 2.220446049250313e-16

I'll keep my research.

estebanz01 commented 9 months ago

hola!!! I merged the changes on #122 so when you have time, you can rebase and we can finally merge the PR once the two minor addtional comments are addressed! 🥳

oliver-czulo commented 8 months ago

You may have noticed that I was working on the PR last weekend. When trying to upload, I get two failed tests still, as follows:

  1. I introduced a mistake in the test class myself where I expected a p-value of 1.469 where Ruby returns 0.0 (it should actually be 1.469045936431957e-25, but we're not calculating with BigDecimal here). I basically fixed that in my code.

  2. Weirder is the other error. In the test, I say expect(result).to eq(Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]]) The test environment, however, tells me:

    expected: Matrix[[293, 51786], [257, 45494], [227, 40083]] 
    got: Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]]

    My impression is that it evaluates my expected expression to Matrix[[293, 51786], ...], and then tests against the very Matrix that I am expecting. Not sure how to get around this one... any idea? Unfortunately, my local test environment is still not functional, but I don't wanna upload further failed tests to Github either.

estebanz01 commented 8 months ago

You may have noticed that I was working on the PR last weekend. When trying to upload, I get two failed tests still, as follows:

1. I introduced a mistake in the test class myself where I expected a p-value of `1.469` where Ruby returns `0.0` (it should actually be `1.469045936431957e-25`, but we're not calculating with BigDecimal here). I basically fixed that in my code.

2. Weirder is the other error. In the test, I say
   `expect(result).to eq(Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]])`
   The test environment, however, tells me:
expected: Matrix[[293, 51786], [257, 45494], [227, 40083]] 
 got: Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]]

My impression is that it evaluates my expected expression to Matrix[[293, 51786], ...], and then tests against the very Matrix that I am expecting. Not sure how to get around this one... any idea? Unfortunately, my local test environment is still not functional, but I don't wanna upload further failed tests to Github either.

yep! I believe the issue with that Matrix test is that you added to_r here: https://github.com/estebanz01/ruby-statistics/pull/115/files#diff-17425d81e942932af286ff65eac7cffd863b318a3836b5e9bae1e07c4c024c41R75-R76

This rational change is propagated to this calculation: https://github.com/estebanz01/ruby-statistics/pull/115/files#diff-17425d81e942932af286ff65eac7cffd863b318a3836b5e9bae1e07c4c024c41R84

I think what you can do is to convert the resulting matrix to integer entries via to_i or to float, if you want to keep potential existing decimals with to_f or the expected matrix convert it to rational entries with to_r. Something like this:

expected_matrix = Matrix[[293, 51786], [257, 45494], [227, 40083]]

# Option # 1
expect(result.map(&:to_i)).to eq expected_matrix
# Option # 2
expect(result.map(&:to_f)).to eq expected_matrix.map(&:to_f)
# Option 3
expect(result).to eq expected_matrix.map(&:to_r)

up to you which option you like the most 😀 and once again, thank you so much for your work on this @oliver-czulo !

estebanz01 commented 8 months ago

Thank you so much for your effort @oliver-czulo ! I'm merging this.

estebanz01 commented 8 months ago

damn, sorry, but I need to revert this. The results are not consistent with R results and I cannot find proper literature to do calculations of the chi square test that cannot be done with simple arrays as it's already implemented.

estebanz01 commented 8 months ago

feel free to open the PR again, I'm going look at the code for this scypi code to understand more what's going on: https://github.com/scipy/scipy/blob/v1.12.0/scipy/stats/contingency.py#L144-L365