Streaming approximations to stats --everything via estimation algorithms

BurntSushi / xsv

A fast CSV command line toolkit written in Rust.

The Unlicense

10.28k stars 316 forks source link

stats --everything will buffer everything in memory and thus it (and its implied flags) sensibly isn't on by default. However, those numbers are still useful for understanding a data set (especially cardinality, e.g. is a field an "enum": a small number of distinct values). There are algorithms such as HyperLogLog (for --cardinality) and Q-Digest (for --median and, possibly, --mode) that give approximate answers for a single pass, and are mergeable (for parallel working), with O(n^small) (or better) memory use (where n is the size of the stream). Could xsv stats use these to provide additional insight into a data set?

BurntSushi / xsv

Streaming approximations to stats --everything via estimation algorithms #144