BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.4k stars 324 forks source link

Add MAD measure #116

Open holtgrewe opened 6 years ago

holtgrewe commented 6 years ago

The MAD is a robust alternative to standard deviation, it would be nice to have besides stddev.

BurntSushi commented 6 years ago

Sorry, but this feature request is incomplete. Please:

xsv cannot be in the business of adding every statistical measure, so each one needs to be vetted individually. The stats computed today are ubiquitous. MAD is not.

holtgrewe commented 6 years ago

The median absolute deviation (cmp. Wikipedia) is a robust alternative to the standard deviation for measuring the variability of a sample. In spirit, it is comparable to the median.

Where the arithmetic mean is based on the sum of sample values, divided by sample count, the median is based on the value with the "center rank". By this, the median is more robust to outlier (the typical example here is the mean net worth of a room of 100 people when one is Bill Gates).

Similarly, the standard deviation is based on the differences between the sample values and the mean (again, outliers such as Bill Gates' net worth will greatly skew the value). In comparison, the median absolute difference is computed by taking the list of absolute differences between the median and the sample values, sorting them and then taking the center rank value.

E.g., in quantitative biology one example would be robustness against outliers in microarray analysis, e.g., stemming from artifacts. One might want to get a measure for the variance of intensity measures. You can think of this as considering a grayscale picture, each pixel having intensity between 0.0 and 1.0. Some pixels might just be set close to 1.0 and are technical artifacts while the overall level might be at 0.1. Here, the MAD would describe the variance of the "majority" the pixels, similar to the median robustly describing "an average pixel".

Of course, one alternative would be trimming the data by cutting away the top and bottom 10% of the data, but that argument could also be given against median.

In terms of being ubiquitous, I would offer

What do you think?