MastodonC / kixi.stats

A library of statistical distribution sampling and transducing functions
https://cljdoc.xyz/d/kixi/stats
361 stars 18 forks source link

Calculating a p-value for a Pearson's correlation #40

Closed kovasap closed 1 month ago

kovasap commented 2 years ago

I'm interested in calculating a p-value for a correlation that I'm calculating using kixi.stats.core/correlation. I'm curious if a function to do this already exists in this library, or if functions that would help do this exist. For example, based on the description at https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/methods-and-formulas/, I need to be able to generate a t distribution (I think?) in order to calculate this p-value.

Or maybe it's possible to calculate this p-value from the error functions that already exist in kixi.stats?

If this is not possible and/or if there is a better way to do this, I'm happy to hear about it!

henrygarner commented 2 years ago

@kovasap this is something which would be an excellent addition to the library. There are a handful of significance tests already in the kixi.stats.test namespace, but not correlation yet.

I've sketched a solution (using kixi.stats.digest/sum-squares rather than kixi.stats.core/correlation because the latter doesn't return the count of rows we need for significance testing). Hopefully it illustrates some of the support you get from the kixi.stats.distribution and kixi.stats.test namespaces for this sort of use case.

(def data [{:x 1 :y 2} {:x 2 :y 2} {:x 3 :y 4} {:x 4 :y 4}])

(require '[kixi.stats.core :as kixi]
         '[kixi.stats.math :refer [sq sqrt]]
         '[kixi.stats.digest :as digest]
         '[kixi.stats.distribution :as dist]
         '[kixi.stats.test :as t])

(defn correlation-test
  [fx fy]
  (kixi/post-complete
   (digest/sum-squares fx fy)
   (fn [{:keys [n ss-x ss-y ss-xy]}]
     (let [denominator (sqrt (* ss-x ss-y))]
       (when-not (zero? denominator)
         (let [r (/ ss-xy denominator)
               degrees-of-freedom (- n 2)
               t-stat (/ (* r (sqrt degrees-of-freedom))
                         (sqrt (- 1 (sq r))))]
           {:correlation r
            :test (t/test-result t-stat (dist/t {:v degrees-of-freedom}))}))))))

(def result
  (transduce identity (correlation-test :x :y) data))

;; Return the calculated correlation
(-> result :correlation)

;; Return the p-value from the test result
(-> result :test t/p-value)

;; Perform a two-tailed test of significance given 5% alpha
(-> result :test (t/significant? 0.05 :<>))
kovasap commented 2 years ago

Thanks for the detailed response! By count of rows do you mean the number of data points used to compute the correlation ((count data) in your example above), which we need to feed into dist/t via the degrees-of-freedom var? Or is there additional logic required to do this kind of test for a kixi.stats.core/correlation?

If/when I end up writing a test for kixi.stats.core/correlation, would you be interested in a pull request?

henrygarner commented 2 years ago

do you mean the number of data points used to compute the correlation ((count data) in your example above

That's right. The number of observations is required for calculating significance and this manifests as the t distribution degrees of freedom. For this reason I haven't used kixi.stats.core/correlation above.

It would be good to consider a consistent strategy for optionally returning significance tests alongside statistics.

PRs welcome!

kovasap commented 2 years ago

Ok got it. I'm in the process of getting this working in my hands for kixi.stats.core/correlation and am having some trouble understanding the code you are calling at https://github.com/MastodonC/kixi.stats/blob/7dbc5fd7873e9a4dc9ddb7a8b8da1d04935a8701/src/kixi/stats/test.cljc#L10-L28 Specifically, I'm wondering what the difference is between the p-value calculation and the significant? calculation. Based on my (mediocre) understanding of statistics, I would expect the significant? calculation to just check if the p-value is below the given value (0.05 in your case). Instead it seems to be using a different distribution to calculate it than the one used for p-value?

henrygarner commented 2 years ago

That's a fair question: it would be possible to leverage the p-value calculation in order to determine significance at a given alpha. Given a test statistic, a test distribution, and a desired alpha, we can determine significance in either of at least two ways:

It happens that the latter approach seems a little more straightforward to me, and I think this is borne out by the complexity of the implementations of p-value and significant? respectively.

kovasap commented 2 years ago

Got it that makes sense. I'm asking because I want to display p-values in a table I'm making and I want to make sure that it's accurate to say that p-values less than my desired alpha are significant (so I can color/sort them appropriately). I wanted to make sure that would give the same results as using significant? so I wouldn't have strange dependencies like a p-value greater than my alpha being classed as significant.