Open huonw opened 6 years ago
That seems plausible to me, although I think the current default mode will use an amount of memory proportional to the largest record, so whether I'm OK with this being the default or not probably depends on what the value of small
is. Also, I think it would be good to note in the output somehow (perhaps in the header field) that it is an approximation.
stats --everything
will buffer everything in memory and thus it (and its implied flags) sensibly isn't on by default. However, those numbers are still useful for understanding a data set (especially cardinality, e.g. is a field an "enum": a small number of distinct values). There are algorithms such as HyperLogLog (for--cardinality
) and Q-Digest (for--median
and, possibly,--mode
) that give approximate answers for a single pass, and are mergeable (for parallel working), with O(nsmall) (or better) memory use (where n is the size of the stream). Couldxsv stats
use these to provide additional insight into a data set?