BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.28k stars 316 forks source link

Streaming approximations to stats --everything via estimation algorithms #144

Open huonw opened 5 years ago

huonw commented 5 years ago

stats --everything will buffer everything in memory and thus it (and its implied flags) sensibly isn't on by default. However, those numbers are still useful for understanding a data set (especially cardinality, e.g. is a field an "enum": a small number of distinct values). There are algorithms such as HyperLogLog (for --cardinality) and Q-Digest (for --median and, possibly, --mode) that give approximate answers for a single pass, and are mergeable (for parallel working), with O(nsmall) (or better) memory use (where n is the size of the stream). Could xsv stats use these to provide additional insight into a data set?

BurntSushi commented 5 years ago

That seems plausible to me, although I think the current default mode will use an amount of memory proportional to the largest record, so whether I'm OK with this being the default or not probably depends on what the value of small is. Also, I think it would be good to note in the output somehow (perhaps in the header field) that it is an approximation.