A Clojure/ClojureScript library of statistical sampling and transducing functions.
Available distributions:
Statistical tests:
Available transducing functions:
Variance, covariance, standard deviation, skewness and kurtosis each have sample and population variants.
View the documentation here.
Examples of kixi.stats
usage can be seen between 10:20-16:00 of this video on Clojure for Machine Learning.
Install kixi.stats
into your Clojure project using the appropriate form at its
Clojars page:
Or grab the most recent code using a Git dependency:
replace $GIT_SHA with the sha you'd like to target!
{kixi/stats
{:git/url "https://github.com/MastodonC/kixi.stats.git"
:git/sha "$GIT_SHA"}}
Transducing functions
kixi.stats.core contains statistical reducing functions that can be used with transduce
:
(require '[kixi.stats.core :refer [standard-deviation correlation]])
(->> [{:x 2} {:x 4} {:x 4} {:x 4} {:x 5} {:x 5} {:x 5} {:x 7} {:x 9}]
(transduce (map :x) standard-deviation))
;; => 2.0
(->> [{:x 1 :y 3} {:x 2 :y 2} {:x 3 :y 1}]
(transduce identity (correlation :x :y)))
;; => -1.0
(->> [{:x 1 :y 3 :z 2} {:x 2 :y 2 :z 4} {:x 3 :y 1 :z 6}]
(transduce identity (correlation-matrix {:x :x :y :y :z :z})))
;; => {[:x :y] -1.0, [:x :z] 1.0, [:y :z] -1.0,
;; [:y :x] -1.0, [:z :x] 1.0, [:z :y] -1.0}
One advantage of using transduce
for statistics calculation is that multiple statistics can be calculated simultaneously by composing together reducing functions. The generic combinators available in redux or xforms can be used together with the reducing functions in kixi.stats
. For example, redux' fuse
will return a higher-order reducing function that can be used to execute an arbitrary number of reducing functions simultaneously:
(require '[kixi.stats.core :refer [mean standard-deviation]]
'[redux.core :refer [fuse]])
;; Calculate mean and standard deviation at the same time:
(->> [2 4 4 4 5 5 5 7 9]
(transduce identity (fuse {:mean mean :sd standard-deviation})))
;; => {:mean 5.0, :sd 2.0}
Integration with transducers means that the wealth of core Clojure support can be applied to working with statistics. For example, filter
can be used to constrain the elements over which statistics are calculated:
(require '[kixi.stats.core :refer [median]])
(def gt5? (filter #(> % 5)))
;; Calculate the median only of numbers greater than 5:
(transduce gt5? median (range 10))
;; => 7.5
So long as xform
is a stateless transducer, we can use it to create a new reducing function locally which doesn't affect other reducing functions also being composed:
(require '[kixi.stats.core :refer [count]]
'[redux.core :refer [fuse]])
(def gt5? (filter #(> % 5)))
;; Count both all numbers and those greater than 5:
(transduce identity (fuse {:n count :gt5 (gt5? count)}) (range 10))
;; => {:n 10, :gt5 4}
The kixi.stats
API is focused primarily on statistical functions and doesn't need to be littered with exhaustive count-when
-style specialisms. Combinators from libraries such as redux and Clojure itself can be used to combine those functions in sophisticated ways.
Empirical distribution histograms
The Clojure version of kixi.stats.core
contains reducing functions for calculating the median, interquartile range and 5-number summary using the t-digest. They can be used like this:
(require '[kixi.stats.core :refer [median iqr summary]]
'[redux.core :refer [fuse]])
;; Calculate the median, iqr and 5-number summary:
(->> (range 100)
(transduce identity (fuse {:median median
:iqr iqr
:summary summary})))
;; => {:median 49.5, :iqr 50.0, :summary {:min 0.0, :q1 24.5, :median 49.5, :q3 74.5, :max 99.0, :iqr 50.0}}
Although this works fine, it should be noted that each function maintains its own digest. In cases where multiple quantiles must be calculated it's more efficient to calculate a single digest with the histogram
function and subsequently query it with the equivalent functions from the kixi.stats.distribution
namespace.
(require '[kixi.stats.core :refer [histogram]]
'[kixi.stats.distribution :refer [quantile]])
;; Calculate the 2.5 and 97.5 quantile from an empirical distribution
(def distribution
(->> (range 100)
(transduce identity histogram)))
{:lower (quantile distribution 0.025)
:upper (quantile distribution 0.975)}
;; => {:lower 2.0, :upper 97.0}
The post-complete
function defined in the kixi.stats.core
allows us to chain the histogram and quantile steps like so:
(require '[kixi.stats.core :refer [histogram post-complete]]
'[kixi.stats.distribution :refer [quantile]])
;; Calculate the 2.5 and 97.5 quantile from an empirical disribution
(->> (range 100)
(transduce identity (post-complete histogram
(fn [hist]
{:lower (quantile hist 0.025)
:upper (quantile hist 0.975)}))
;; => {:lower 2.0, :upper 97.0}
The kixi.stats.distribution
namespace contains many functions for operating on histograms which mirror the names from kixi.stats.core
: cdf
, iqr
, minimum
, maximum
, quantile
and summary
. In each case, the kixi.stats.core
function will return a reducing function for use with transduce
whereas the kixi.stats.distribution
function will accept a calculated digest and return a value directly.
Distribution sampling
kixi.stats.distribution contains functions for specifying and sampling from statistical distributions.
(require '[kixi.stats.distribution :refer [draw sample binomial]])
(draw (binomial {:n 100 :p 0.5}))
;;=> 54
(sample 10 (binomial {:n 100 :p 0.5}))
;;=> (49 53 53 44 55 47 45 51 49 51)
draw
and sample
are the primary means of extracting variates from a distribution. draw
returns a single variate whereas sample
returns n variates.
Each distribution implements the clojure.lang.ISeq
/ ISeqable
interface, so n variates can be sampled with (take n (binomial {:n 100 :p 0.5}))
. However, where possible sample
uses optimisations to return exactly n variates, and should be preferred.
Discrete summarisation
The Bernoulli, binomial and categorical distributions are discrete, so samples can be summarised by counting the number of times each variate appears. Discrete distributions can be directly sampled in this way with sample-summary
:
(require '[kixi.stats.distribution :refer [sample-summary bernoulli]])
(sample-summary 1000 (bernoulli {:p 0.3}))
;;=> {true 296, false 704}
This is equivalent to (frequencies (sample 1000 (bernoulli {:p 0.3})))
, but where possible sample-summary
uses optimisations to avoid reifying and aggregating a large intermediate sample, and should be preferred. When sample-summary
doesn't return a value for a particular variate, that value should be assumed zero.
Deterministic sampling
The sampling functions draw
, sample
and sample-summary
are all designed to perform deterministically when provided with a seed value. If repeatable samples are desired, pass {:seed SEED_LONG}
as the final argument:
(require '[kixi.stats.distribution :refer [uniform]])
(draw (uniform {:a 0 :b 1}) {:seed 42})
;;=> 0.7415648787718233
(draw (uniform {:a 0 :b 1}) {:seed 42})
;;=> 0.7415648787718233
Statistical tests
The kixi.stats.test namespace contains functions for performing statistical tests.
For example, we can perform a z-test between a known population mean & standard deviation and a sampled mean with a given sample size in the following way:
(require '[kixi.stats.test :refer [simple-z-test]])
(simple-z-test {:mu 100 :sd 12} {:mean 96 :n 55} {:tails :lower})
;;=> {:p-value 0.0067167326028858}
As with the kixi.stats.distribution
namespace - which contains many functions which mirror kixi.stats.core
- simple-z-test
is also available in kixi.stats.core
. The latter function returns a reducing function for use with transduce
.
(require '[kixi.stats.core :refer [simple-z-test]])
;; If the standard deviation is not provided, the sample standard deviation will be used instead (a 'plug-in test')
(transduce identity (simple-z-test {:mu 100}) (range 200))
;;=> {:p-value 0.9027648250246222}
Statistical reducing functions strong influenced by Tesser. Pseudorandom number generation is provided by test.check.
Copyright © 2024 Mastodon C Ltd
Distributed under the Eclipse Public License version 1.0.