acowley / Frames

Data frames for tabular data.
Other
297 stars 41 forks source link

"Add a summarize class" - foldl-statistics? #88

Open axman6 opened 7 years ago

axman6 commented 7 years ago

I was having a look at the Frames-notes.org file, and saw you're after some summary functionality for Frames. I created the folds-statistics package for just this sort of use case - with the fastLMVSK Fold, you can compute the length, mean, variance, skewness and kurtosis of a numeric column (if the values can be turned into Double's) in one pass.

min and max are also provided. median is much harder since an accurate one requires storing all values in the column, similarly for unique values, though both have approximations which might be suitable.

acowley commented 7 years ago

Hey, that sounds perfect!

I think the reason that effort stalled is some form of decision fatigue. For Text columns, it would be nice to detect if every row has one of a small set of values (corresponding to something like an enumerated type). We can do that by folding a Maybe (Set Text) down the column, turning it to Nothing when its size crosses some threshold. But we need to pick a default threshold.

For Bool columns, maybe it makes sense to track the ratio of True to False values. For Int, we could compute statistics as we would for Double, or we could treat them as potentially coming from a small set where the Ints are some kind of tag. In that case, we could do something like the approach outlined for Text, where we fold a Maybe (Set Int) down the column.

These things are definitely do-able, but it will require a bit of a push to wire it all together. The fastLMVSK fold you suggest sounds like a great fit for Double.