Closed sbelak closed 6 years ago
Hey Simon, it's quite a coincidence you chose to get in touch because I've been wondering about this only today. Thanks for offering to help make kixi.stats better!
I'm thinking of the median as a convenience function on a more general histogram
reducer, which as you say, could be backed by any of a number of implementations depending on desired tradeoffs. This would make including other summary statistics such as the IQR trivial.
One option with configurable accuracy is the t-digest. There are existing Java and JavaScript implementations, but the JavaScript implementation requires a few transitive dependencies so I thought I'd see if a Clojure implementation would keep things simple. I've begun to sketch out a generic Clojure(Script) implementation here, but it's just a prototype at the moment.
I envisage the api being something like:
(transduce (map :x) (kixi/median kixi/t-digest {:delta 0.01}) data)
where (kixi/median histogram opts)
is a convenience for (redux/post-complete (histogram opts) #(percentile % 50))
and with a sensible default histogram
:
(transduce (map :x) (kixi/median) data)
Would this meet your needs? Would you do anything differently or would you like to see an alternative histogram implementation prioritised?
Thanks, Henry
You are right, I was looking at the problem too narrowly. When it comes to histograms, I quite like https://github.com/bigmlcom/histogram, especially if wrapped in a transducer.
@sbelak thanks for the link! I'm checking it out now and weighing up its features against introducing a Java-only dependency
Closed by #15
Median is the only function I still turn to Incater for and I would like to stop doing that (and contribute to a library I use daily). Calculating median efficiently is slightly tricky though -- have you given any thought which guarantees/compromises you want to make with kixi? From what I've gathered, to calculate median we need to either accept an approximate result, have the entire seq in memory at some point during computation, and/or use a specialized version for ints (or just have 3 all 3 versions: median, median-int, and median-approx).