estebanz01 / ruby-statistics

Ruby gem for some statistical operations without any statistical language dependency
MIT License
109 stars 16 forks source link

result of Distribution::ChiSquared.cumulative_function outside [0..1] range #113

Open oliver-czulo opened 1 year ago

oliver-czulo commented 1 year ago

I work with the following two test contingency tables (observed1, observed2) with these resulting variables:

observed1 = [[388,51692],[119,45633]]
expected1 = [[269.8969662278191, 51810.10303377218], [237.10303377218088, 45514.89696622782]]
chi_score1 = 111.0839904758887
df1 = 1

observed2 = [[388,51692],[119,45633],[271,40040]]
expected2 = [[293.3065012342283, 51786.69349876577], [257.66818441759625, 45494.3318155824], [227.02531434817544, 40083.974685651825]]
chi_score2 = 114.36002831520479
df2 = 2

When calculated with Python, I get the following results when extracting the pvalue (replace observed with observed1 or observed2):

import scipy.stats as stats
# include above variables here
X2 = stats.chi2_contingency(observed, correction=False)[0]

# observed1 => pvalue = 5.671618200219206e-26
# observed2 => pvalue = 1.469045936431957e-25

When using Distribution::ChiSquared.cumulative_function from the ruby-statistics package, I get the following results (replace dfwith df1or df2, chi_score likewise):

probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
p_value = 1.0 - probability

# observed1 => p_value = Infinity
# observed2 => p_value = 1.0000333200206515

While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?

estebanz01 commented 1 year ago

ok, so a couple of questions:

I took the liberty to sightly modify your inputs and this is what I got:

# data setup
observations_1 = [388, 51692, 119, 45633]
expectations_1 = [269.8969662278191, 51810.10303377218, 237.10303377218088, 45514.89696622782]
chi_score_1, df_1 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_1, observations_1)

# outputs
# irb(main):025:0> chi_score_1
# => 111.0839904758887
# irb(main):026:0> df_1
# => 3

#  Goodness of fit with 95 % confidence
StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_1, observations_1)
# => {:probability=>1.0000001945374604, :p_value=>-1.945374603629091e-07, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

now, the problem still persist in that probability is bigger than 1.0 😛 so I'm going to look into that.

Here's the same modifications made for the second example:

irb(main):028:0> observations_2 = [388, 51692, 119, 45633, 271, 40040]
=> [388, 51692, 119, 45633, 271, 40040]
irb(main):029:0> expectations_2 = [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
=> [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
irb(main):030:0> chi_score_2, df_2 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_2, observations_2)
=> [114.36002831520479, 5]
irb(main):031:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_2, observations_2)
=> {:probability=>1.0000000000064857, :p_value=>-6.4857008652552395e-12, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

I'll also going to do a read on the python implementation. I'm not familiar with that package, so probably it works the same but with a different API.

estebanz01 commented 1 year ago

here's a variant that gives me infinity in the probability with the degrees of freedom you are using:

irb(main):032:0> observed_1_variant = [388,51692]
=> [388, 51692]
irb(main):033:0> expected_1_variant = [269.8969662278191, 51810.10303377218]
=> [269.8969662278191, 51810.10303377218]
irb(main):034:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expected_1_variant, observed_1_variant)
=> {:probability=>Infinity, :p_value=>-Infinity, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

from what I recall (but correct me if I'm wrong!), we need a minimum of 3 or more elements to perform a Chi squared test, so this result makes sense to me: I need more data to perform a proper test 😛

estebanz01 commented 1 year ago

here's an R version:

> expected
[1]   269.897 51810.103   237.103 45514.897
> observed
[1]   388 51692   119 45633
> chisq.test(as.table(observed, expected))

    Chi-squared test for given probabilities

data:  as.table(observed, expected)
X-squared = 96566, df = 3, p-value < 2.2e-16

> chisq.test(as.table(expected, observed))

    Chi-squared test for given probabilities

data:  as.table(expected, observed)
X-squared = 96625, df = 3, p-value < 2.2e-16
oliver-czulo commented 1 year ago

As for questions 1 and 2: The nested array observed1, for instance, represents a 22 contingency table, each sub-array stands for a row. Degree of freedom is then calculated as `#rows-1 #columns-1` (see also this statology page).

As for question 3, I use the following function (in a local test class) to calculate the chi score iterating over all cells of the contingency table:

def self.calculate_chi_score(observed, expected)
    sum = 0.0
    observed.size.times { |i|
        observed[i].size.times { |j|
            sum += ((observed[i])[j] - (expected[i])[j])**2 / (expected[i])[j]
         }
    }
    sum
end

I guess the implementation is not optimal, but I wasn't successful in reproducing this behaviour reliably with map or similar functions.

estebanz01 commented 1 year ago

got it. I'll do some reading and exploration then and be right back to you.

obromios commented 1 week ago

Consider a chi square distribution table. If you look at the second row (df=2), it predicts a value of 0.1 if the chi_squared value is 4.61. Using your gem

>> 1-Statistics::Distribution::ChiSquared.new(2).cumulative_function(4.61)
=> 0.09972206384119364

which rounds to 0.10. so it appears your function and the table agree. But if we try the first row (df = 1), for a chi_squared value of 2.71. it also predicts a value of 0.1, but your function does

>> 1-Statistics::Distribution::ChiSquared.new(1).cumulative_function(2.71)
=> -Infinity

i.e. it does not agree with the table. As far as I can see, if you use df of 1, then any chi_squared_value gives -Infinity. It appears to me that the function is not returning the correct value when the number of degrees of freedom is 1.

estebanz01 commented 1 week ago

@obromios hi! interesting check. It's been a while since I try to stab a this problem, so it's going to be a bit slow for me to reply with a possible fix, if there's one at all.

Thanks for looking at it and I'll try my best to allocate time to fix this!