Closed oliver-czulo closed 8 months ago
Happy new year! 🎉 I'm back so I'm going to give another pass and let you know.
@oliver-czulo can you push and update your branch with latest master, please? for some reason I'm not seeing your latest changes 😀
Hi, just pushed the changes. Added tests, but failed to run them locally :(
There seems to be a problem with line 57 in lib/statistics/statistical_test/chi_squared_test.rb
which reads probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
: The expected values and the chi_score
I calculate match those of the Python class I compare with, but this line yields the value 1.0000333200206522
insted of 1.469045936431957e-25
(as from the Python class).
Hi, just pushed the changes. Added tests, but failed to run them locally :(
There seems to be a problem with line 57 in
lib/statistics/statistical_test/chi_squared_test.rb
which readsprobability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
: The expected values and thechi_score
I calculate match those of the Python class I compare with, but this line yields the value1.0000333200206522
insted of1.469045936431957e-25
(as from the Python class).
thanks! I'll take a look. Regarding tests, I use R as a source of truth, so I'll do a check there instead of python for this.
I read the matrix completely backwards 😁 here's the results from R:
> y <- as.table(matrix(c(388,51692, 119,45633, 271,40040), ncol=2, byrow=TRUE))
> y
A B
A 388 51692
B 119 45633
C 271 40040
> summary(y)
Number of cases in table: 138143
Number of factors: 2
Test for independence of all factors:
Chisq = 114.3600283152, df = 2, p-value = 1.46904593643e-25
> chisq.test(y, correct = F, rescale.p = F)
Pearson's Chi-squared test
data: y
X-squared = 114.3600283152, df = 2, p-value < 2.220446049e-16
I'll check the calculations again on the ruby side.
I read the matrix completely backwards 😁
Might be due to a difference in how the R functions and the Ruby & Python functions read the tables 😊 It's also not consistent in the "literature"; for comparison, the following example seems to build the contingency tables in R style, unlike other examples: https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/
I'll check the calculations again on the ruby side.
The one difference between the R calculation and the Ruby code seems to originate from the call in what is now line 57: probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
, i.e. it seems the cumulative_function
is returning something wrong.
///
Changes for the latest push:
private_class_method
(thanks for the hint). test_of_independence
does not contain a matrix
component, as this kind of test, to my knowledge, only refers to contingeny tables anyway (i.e. there should be no such test for numerals or lists/onedimensional arrays). The variables and the methods used by the test_of_independence
-method all contain matrix
, however, to indicate their use of the type Matrix for contingency table representation.Currently, I don't see further tasks on my side for this class, do let me know in case I have overlooked anything. I'll look into the cumulative_function
to see whether I can identify anything.
I'll look into the
cumulative_function
to see whether I can identify anything.
I found the culprit, python and R (when calling summary function on a table), are returning the calculation of the special case for the CDF of the chi squared function (as described in the wiki).
but now, this raises me a question on how it's really calculated the test of independence for contingency tables in the R implementation, because it's giving me a different p-value:
> chisq.test(matrix(c(388,51692, 119,45633, 271,40040), ncol = 2, nrow = 3), correct = F, rescale.p = F)
Pearson's Chi-squared test
data: matrix(c(388, 51692, 119, 45633, 271, 40040), ncol = 2, nrow = 3)
X-squared = 134854.9483015367296, df = 2, p-value < 2.220446049250313e-16
I'll keep my research.
hola!!! I merged the changes on #122 so when you have time, you can rebase and we can finally merge the PR once the two minor addtional comments are addressed! 🥳
You may have noticed that I was working on the PR last weekend. When trying to upload, I get two failed tests still, as follows:
I introduced a mistake in the test class myself where I expected a p-value of 1.469
where Ruby returns 0.0
(it should actually be 1.469045936431957e-25
, but we're not calculating with BigDecimal here). I basically fixed that in my code.
Weirder is the other error. In the test, I say
expect(result).to eq(Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]])
The test environment, however, tells me:
expected: Matrix[[293, 51786], [257, 45494], [227, 40083]]
got: Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]]
My impression is that it evaluates my expected expression to Matrix[[293, 51786], ...]
, and then tests against the very Matrix that I am expecting. Not sure how to get around this one... any idea? Unfortunately, my local test environment is still not functional, but I don't wanna upload further failed tests to Github either.
You may have noticed that I was working on the PR last weekend. When trying to upload, I get two failed tests still, as follows:
1. I introduced a mistake in the test class myself where I expected a p-value of `1.469` where Ruby returns `0.0` (it should actually be `1.469045936431957e-25`, but we're not calculating with BigDecimal here). I basically fixed that in my code. 2. Weirder is the other error. In the test, I say `expect(result).to eq(Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]])` The test environment, however, tells me:
expected: Matrix[[293, 51786], [257, 45494], [227, 40083]] got: Matrix[[(40518240/138143), (7153969200/138143)], [(35595056/138143), (6284723480/138143)], [(31361958/138143), (5537320515/138143)]]
My impression is that it evaluates my expected expression to
Matrix[[293, 51786], ...]
, and then tests against the very Matrix that I am expecting. Not sure how to get around this one... any idea? Unfortunately, my local test environment is still not functional, but I don't wanna upload further failed tests to Github either.
yep! I believe the issue with that Matrix test is that you added to_r
here: https://github.com/estebanz01/ruby-statistics/pull/115/files#diff-17425d81e942932af286ff65eac7cffd863b318a3836b5e9bae1e07c4c024c41R75-R76
This rational change is propagated to this calculation: https://github.com/estebanz01/ruby-statistics/pull/115/files#diff-17425d81e942932af286ff65eac7cffd863b318a3836b5e9bae1e07c4c024c41R84
I think what you can do is to convert the resulting matrix to integer entries via to_i
or to float, if you want to keep potential existing decimals with to_f
or the expected matrix convert it to rational entries with to_r
. Something like this:
expected_matrix = Matrix[[293, 51786], [257, 45494], [227, 40083]]
# Option # 1
expect(result.map(&:to_i)).to eq expected_matrix
# Option # 2
expect(result.map(&:to_f)).to eq expected_matrix.map(&:to_f)
# Option 3
expect(result).to eq expected_matrix.map(&:to_r)
up to you which option you like the most 😀 and once again, thank you so much for your work on this @oliver-czulo !
Thank you so much for your effort @oliver-czulo ! I'm merging this.
damn, sorry, but I need to revert this. The results are not consistent with R results and I cannot find proper literature to do calculations of the chi square test that cannot be done with simple arrays as it's already implemented.
feel free to open the PR again, I'm going look at the code for this scypi code to understand more what's going on: https://github.com/scipy/scipy/blob/v1.12.0/scipy/stats/contingency.py#L144-L365
Added functions to test contingency tables for independence of observations. Tested for plausibility for 2x2 and 3x3 tables