Closed kovasap closed 1 month ago
@kovasap this is something which would be an excellent addition to the library. There are a handful of significance tests already in the kixi.stats.test
namespace, but not correlation yet.
I've sketched a solution (using kixi.stats.digest/sum-squares
rather than kixi.stats.core/correlation
because the latter doesn't return the count of rows we need for significance testing). Hopefully it illustrates some of the support you get from the kixi.stats.distribution
and kixi.stats.test
namespaces for this sort of use case.
(def data [{:x 1 :y 2} {:x 2 :y 2} {:x 3 :y 4} {:x 4 :y 4}])
(require '[kixi.stats.core :as kixi]
'[kixi.stats.math :refer [sq sqrt]]
'[kixi.stats.digest :as digest]
'[kixi.stats.distribution :as dist]
'[kixi.stats.test :as t])
(defn correlation-test
[fx fy]
(kixi/post-complete
(digest/sum-squares fx fy)
(fn [{:keys [n ss-x ss-y ss-xy]}]
(let [denominator (sqrt (* ss-x ss-y))]
(when-not (zero? denominator)
(let [r (/ ss-xy denominator)
degrees-of-freedom (- n 2)
t-stat (/ (* r (sqrt degrees-of-freedom))
(sqrt (- 1 (sq r))))]
{:correlation r
:test (t/test-result t-stat (dist/t {:v degrees-of-freedom}))}))))))
(def result
(transduce identity (correlation-test :x :y) data))
;; Return the calculated correlation
(-> result :correlation)
;; Return the p-value from the test result
(-> result :test t/p-value)
;; Perform a two-tailed test of significance given 5% alpha
(-> result :test (t/significant? 0.05 :<>))
Thanks for the detailed response! By count of rows do you mean the number of data points used to compute the correlation ((count data)
in your example above), which we need to feed into dist/t
via the degrees-of-freedom
var? Or is there additional logic required to do this kind of test for a kixi.stats.core/correlation
?
If/when I end up writing a test for kixi.stats.core/correlation
, would you be interested in a pull request?
do you mean the number of data points used to compute the correlation ((count data) in your example above
That's right. The number of observations is required for calculating significance and this manifests as the t distribution degrees of freedom. For this reason I haven't used kixi.stats.core/correlation
above.
It would be good to consider a consistent strategy for optionally returning significance tests alongside statistics.
PRs welcome!
Ok got it. I'm in the process of getting this working in my hands for kixi.stats.core/correlation
and am having some trouble understanding the code you are calling at https://github.com/MastodonC/kixi.stats/blob/7dbc5fd7873e9a4dc9ddb7a8b8da1d04935a8701/src/kixi/stats/test.cljc#L10-L28
Specifically, I'm wondering what the difference is between the p-value
calculation and the significant?
calculation. Based on my (mediocre) understanding of statistics, I would expect the significant?
calculation to just check if the p-value is below the given value (0.05
in your case). Instead it seems to be using a different distribution to calculate it than the one used for p-value
?
That's a fair question: it would be possible to leverage the p-value calculation in order to determine significance at a given alpha. Given a test statistic, a test distribution, and a desired alpha, we can determine significance in either of at least two ways:
It happens that the latter approach seems a little more straightforward to me, and I think this is borne out by the complexity of the implementations of p-value
and significant?
respectively.
Got it that makes sense. I'm asking because I want to display p-values in a table I'm making and I want to make sure that it's accurate to say that p-values less than my desired alpha are significant (so I can color/sort them appropriately). I wanted to make sure that would give the same results as using significant?
so I wouldn't have strange dependencies like a p-value greater than my alpha being classed as significant.
I'm interested in calculating a p-value for a correlation that I'm calculating using
kixi.stats.core/correlation
. I'm curious if a function to do this already exists in this library, or if functions that would help do this exist. For example, based on the description at https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/methods-and-formulas/, I need to be able to generate a t distribution (I think?) in order to calculate this p-value.Or maybe it's possible to calculate this p-value from the error functions that already exist in kixi.stats?
If this is not possible and/or if there is a better way to do this, I'm happy to hear about it!