estebanz01 / ruby-statistics

Ruby gem for some statistical operations without any statistical language dependency
MIT License
101 stars 16 forks source link

result of Distribution::ChiSquared.cumulative_function outside [0..1] range #113

Open oliver-czulo opened 10 months ago

oliver-czulo commented 10 months ago

I work with the following two test contingency tables (observed1, observed2) with these resulting variables:

observed1 = [[388,51692],[119,45633]]
expected1 = [[269.8969662278191, 51810.10303377218], [237.10303377218088, 45514.89696622782]]
chi_score1 = 111.0839904758887
df1 = 1

observed2 = [[388,51692],[119,45633],[271,40040]]
expected2 = [[293.3065012342283, 51786.69349876577], [257.66818441759625, 45494.3318155824], [227.02531434817544, 40083.974685651825]]
chi_score2 = 114.36002831520479
df2 = 2

When calculated with Python, I get the following results when extracting the pvalue (replace observed with observed1 or observed2):

import scipy.stats as stats
# include above variables here
X2 = stats.chi2_contingency(observed, correction=False)[0]

# observed1 => pvalue = 5.671618200219206e-26
# observed2 => pvalue = 1.469045936431957e-25

When using Distribution::ChiSquared.cumulative_function from the ruby-statistics package, I get the following results (replace dfwith df1or df2, chi_score likewise):

probability = 1.0 - Statistics::Distribution::ChiSquared.new(df).cumulative_function(chi_score)
p_value = 1.0 - probability

# observed1 => p_value = Infinity
# observed2 => p_value = 1.0000333200206515

While I cannot confirm the Python module yields the correct values, it seems more plausible, considering that the p-value should be in the range [0..1]?

estebanz01 commented 10 months ago

ok, so a couple of questions:

I took the liberty to sightly modify your inputs and this is what I got:

# data setup
observations_1 = [388, 51692, 119, 45633]
expectations_1 = [269.8969662278191, 51810.10303377218, 237.10303377218088, 45514.89696622782]
chi_score_1, df_1 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_1, observations_1)

# outputs
# irb(main):025:0> chi_score_1
# => 111.0839904758887
# irb(main):026:0> df_1
# => 3

#  Goodness of fit with 95 % confidence
StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_1, observations_1)
# => {:probability=>1.0000001945374604, :p_value=>-1.945374603629091e-07, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

now, the problem still persist in that probability is bigger than 1.0 😛 so I'm going to look into that.

Here's the same modifications made for the second example:

irb(main):028:0> observations_2 = [388, 51692, 119, 45633, 271, 40040]
=> [388, 51692, 119, 45633, 271, 40040]
irb(main):029:0> expectations_2 = [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
=> [293.3065012342283, 51786.69349876577, 257.66818441759625, 45494.3318155824, 227.02531434817544, 40083.974685651825]
irb(main):030:0> chi_score_2, df_2 = *StatisticalTest::ChiSquaredTest.chi_statistic(expectations_2, observations_2)
=> [114.36002831520479, 5]
irb(main):031:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expectations_2, observations_2)
=> {:probability=>1.0000000000064857, :p_value=>-6.4857008652552395e-12, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

I'll also going to do a read on the python implementation. I'm not familiar with that package, so probably it works the same but with a different API.

estebanz01 commented 10 months ago

here's a variant that gives me infinity in the probability with the degrees of freedom you are using:

irb(main):032:0> observed_1_variant = [388,51692]
=> [388, 51692]
irb(main):033:0> expected_1_variant = [269.8969662278191, 51810.10303377218]
=> [269.8969662278191, 51810.10303377218]
irb(main):034:0> StatisticalTest::ChiSquaredTest.goodness_of_fit(0.05, expected_1_variant, observed_1_variant)
=> {:probability=>Infinity, :p_value=>-Infinity, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

from what I recall (but correct me if I'm wrong!), we need a minimum of 3 or more elements to perform a Chi squared test, so this result makes sense to me: I need more data to perform a proper test 😛

estebanz01 commented 10 months ago

here's an R version:

> expected
[1]   269.897 51810.103   237.103 45514.897
> observed
[1]   388 51692   119 45633
> chisq.test(as.table(observed, expected))

    Chi-squared test for given probabilities

data:  as.table(observed, expected)
X-squared = 96566, df = 3, p-value < 2.2e-16

> chisq.test(as.table(expected, observed))

    Chi-squared test for given probabilities

data:  as.table(expected, observed)
X-squared = 96625, df = 3, p-value < 2.2e-16
oliver-czulo commented 10 months ago

As for questions 1 and 2: The nested array observed1, for instance, represents a 22 contingency table, each sub-array stands for a row. Degree of freedom is then calculated as `#rows-1 #columns-1` (see also this statology page).

As for question 3, I use the following function (in a local test class) to calculate the chi score iterating over all cells of the contingency table:

def self.calculate_chi_score(observed, expected)
    sum = 0.0
    observed.size.times { |i|
        observed[i].size.times { |j|
            sum += ((observed[i])[j] - (expected[i])[j])**2 / (expected[i])[j]
         }
    }
    sum
end

I guess the implementation is not optimal, but I wasn't successful in reproducing this behaviour reliably with map or similar functions.

estebanz01 commented 10 months ago

got it. I'll do some reading and exploration then and be right back to you.