MastodonC / kixi.stats

A library of statistical distribution sampling and transducing functions
https://cljdoc.xyz/d/kixi/stats
360 stars 18 forks source link

Median #4

Closed sbelak closed 6 years ago

sbelak commented 7 years ago

Median is the only function I still turn to Incater for and I would like to stop doing that (and contribute to a library I use daily). Calculating median efficiently is slightly tricky though -- have you given any thought which guarantees/compromises you want to make with kixi? From what I've gathered, to calculate median we need to either accept an approximate result, have the entire seq in memory at some point during computation, and/or use a specialized version for ints (or just have 3 all 3 versions: median, median-int, and median-approx).

henrygarner commented 7 years ago

Hey Simon, it's quite a coincidence you chose to get in touch because I've been wondering about this only today. Thanks for offering to help make kixi.stats better!

I'm thinking of the median as a convenience function on a more general histogram reducer, which as you say, could be backed by any of a number of implementations depending on desired tradeoffs. This would make including other summary statistics such as the IQR trivial.

One option with configurable accuracy is the t-digest. There are existing Java and JavaScript implementations, but the JavaScript implementation requires a few transitive dependencies so I thought I'd see if a Clojure implementation would keep things simple. I've begun to sketch out a generic Clojure(Script) implementation here, but it's just a prototype at the moment.

I envisage the api being something like:

(transduce (map :x) (kixi/median kixi/t-digest {:delta 0.01}) data)

where (kixi/median histogram opts) is a convenience for (redux/post-complete (histogram opts) #(percentile % 50))

and with a sensible default histogram:

(transduce (map :x) (kixi/median) data)

Would this meet your needs? Would you do anything differently or would you like to see an alternative histogram implementation prioritised?

Thanks, Henry

sbelak commented 7 years ago

You are right, I was looking at the problem too narrowly. When it comes to histograms, I quite like https://github.com/bigmlcom/histogram, especially if wrapped in a transducer.

henrygarner commented 7 years ago

@sbelak thanks for the link! I'm checking it out now and weighing up its features against introducing a Java-only dependency

henrygarner commented 6 years ago

Closed by #15