Open oliver-czulo opened 1 year ago
ok, so a couple of questions:
(observed elements - 1)
? for observed1
shouldn't be 3 and for observed2
be 5 ?[388,51692]
and [119,45633]
you want to perform a Chi squared test on both arrays in a separated manner or do you want to compare both so only one chi squared test is performed?chi_score
variable?I took the liberty to sightly modify your inputs and this is what I got:
# data setup
observations_1 = [388, 51692, 119, 45633]
expectations_1 = [269.8969662278191, 51810.10303377218, 237.10303377218088, 45514.89696622782]
chi_score_1, df_1 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_1, observations_1)
# outputs
# irb(main):025:0> chi_score_1
# => 111.0839904758887
# irb(main):026:0> df_1
# => 3
# Goodness of fit with 95 % confidence
StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_1, observations_1)
# => {:probability=>1.0000001945374604, :p_value=>-1.945374603629091e-07, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}
now, the problem still persist in that probability is bigger than 1.0 😛 so I'm going to look into that.
Here's the same modifications made for the second example:
irb(main):028:0> observations_2 = [388, 51692, 119, 45633, 271, 40040]
=> [388, 51692, 119, 45633, 271, 40040]
irb(main):029:0> expectations_2 = [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
=> [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
irb(main):030:0> chi_score_2, df_2 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_2, observations_2)
=> [114.36002831520479, 5]
irb(main):031:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_2, observations_2)
=> {:probability=>1.0000000000064857, :p_value=>-6.4857008652552395e-12, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}
I'll also going to do a read on the python implementation. I'm not familiar with that package, so probably it works the same but with a different API.
here's a variant that gives me infinity in the probability with the degrees of freedom you are using:
irb(main):032:0> observed_1_variant = [388,51692]
=> [388, 51692]
irb(main):033:0> expected_1_variant = [269.8969662278191, 51810.10303377218]
=> [269.8969662278191, 51810.10303377218]
irb(main):034:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expected_1_variant, observed_1_variant)
=> {:probability=>Infinity, :p_value=>-Infinity, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}
from what I recall (but correct me if I'm wrong!), we need a minimum of 3 or more elements to perform a Chi squared test, so this result makes sense to me: I need more data to perform a proper test 😛
here's an R version:
> expected
[1] 269.897 51810.103 237.103 45514.897
> observed
[1] 388 51692 119 45633
> chisq.test(as.table(observed, expected))
Chi-squared test for given probabilities
data: as.table(observed, expected)
X-squared = 96566, df = 3, p-value < 2.2e-16
> chisq.test(as.table(expected, observed))
Chi-squared test for given probabilities
data: as.table(expected, observed)
X-squared = 96625, df = 3, p-value < 2.2e-16
As for questions 1 and 2: The nested array observed1
, for instance, represents a 22 contingency table, each sub-array stands for a row. Degree of freedom is then calculated as `#rows-1 #columns-1` (see also this statology page).
As for question 3, I use the following function (in a local test class) to calculate the chi score iterating over all cells of the contingency table:
def self.calculate_chi_score(observed, expected)
sum = 0.0
observed.size.times { |i|
observed[i].size.times { |j|
sum += ((observed[i])[j] - (expected[i])[j])**2 / (expected[i])[j]
}
}
sum
end
I guess the implementation is not optimal, but I wasn't successful in reproducing this behaviour reliably with map
or similar functions.
got it. I'll do some reading and exploration then and be right back to you.
Consider a chi square distribution table. If you look at the second row (df=2), it predicts a value of 0.1 if the chi_squared value is 4.61. Using your gem
>> 1-Statistics::Distribution::ChiSquared.new(2).cumulative_function(4.61)
=> 0.09972206384119364
which rounds to 0.10. so it appears your function and the table agree. But if we try the first row (df = 1), for a chi_squared value of 2.71. it also predicts a value of 0.1, but your function does
>> 1-Statistics::Distribution::ChiSquared.new(1).cumulative_function(2.71)
=> -Infinity
i.e. it does not agree with the table. As far as I can see, if you use df of 1, then any chi_squared_value gives -Infinity. It appears to me that the function is not returning the correct value when the number of degrees of freedom is 1.
@obromios hi! interesting check. It's been a while since I try to stab a this problem, so it's going to be a bit slow for me to reply with a possible fix, if there's one at all.
Thanks for looking at it and I'll try my best to allocate time to fix this!
I work with the following two test contingency tables (
observed1
,observed2
) with these resulting variables:When calculated with Python, I get the following results when extracting the
pvalue
(replaceobserved
withobserved1
orobserved2
):When using
Distribution::ChiSquared.cumulative_function
from theruby-statistics
package, I get the following results (replacedf
withdf1
ordf2
,chi_score
likewise):While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?