fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.06k stars 164 forks source link

Robust statistics: trimmed mean and interquartile range #379

Open Beliavsky opened 3 years ago

Beliavsky commented 3 years ago

Since sorting and I assume ranking will be part of stdlib, here is some related functionality to consider. In another issue I mentioned the median. The median is a special case of a trimmed mean, where you trim the top N and bottom N values for a vector with 2N+1 or 2N+2 elements and compute the mean of the remaining 1 or 2 values. Trimmed means are a commonly used robust statistic -- one computes the mean after removing top N and bottom N. So a function

function trimmed_mean(x,ntrim_high,ntrim_low) result(xmean)
real, intent(in) :: x
integer, intent(in) :: ntrim_high,ntrim_low
real :: xmean ! mean after removing ntrim_high and ntrim_low values
end function trimmed_mean

could be created. In quantitative finance, insurance, and probably other domains, you often want to know what the mean of N largest or smallest observations were. A complement to the function above would be

function extreme_mean(x,nhigh,nlow) result(xmean)
real, intent(in) :: x
integer, intent(in), optional :: nhigh,nlow
real :: xmean ! mean of nhigh and nlow values
end function extreme_mean

Usually only one of nhigh and nlow would be specified by the user, but the case where both are specified can also be handled. In quantitative finance, the mean of the N lowest returns of a sample is an estimator of the Conditional Value at Risk.

The most commonly computed average is just the sample mean, which is in stdlib. Quantities such as the trimmed mean are less trivial to program and would perhaps provide a greater benefit by being in stdlib.

The descriptive statistics section has var to compute variance, which is the square of the standard deviation. A robust alternative to standard deviation to measure the spread of data is the interquartile range (IQR) (the difference of 75th and 25th percentiles). Having percentiles and IQR in stdlib would be nice. If you ask R to summarize 1000 normal variates, it says

> x = rnorm(1000)
> summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.84938 -0.65930  0.06237  0.06395  0.75097  3.22734 

So statisticians think percentile measures are as important as the standard deviation to understand spread.

jvdp1 commented 3 years ago

Thank you for opening this issue. Your propositions are indeed in the scope of stdlib, and IMO would be nice and useful additions (at least for my personal usage). Do all these functions rely on sorting algorithms? Or could some of them already be implemented (if yes, these could be nice first issues)?

Beliavsky commented 1 year ago

Fast Computation of Trimmed Means by Beliakov has a fast algorithm for trimmed means, which is implemented in C and has an associated R package.

Beliavsky commented 12 months ago

The interface of scipy.stats.trim_mean could be emulated:

scipy.stats.trim_mean(a, proportiontocut, axis=0)

Return mean of array after trimming distribution from both tails.

If proportiontocut = 0.1, slices off ‘leftmost’ and ‘rightmost’ 10% of scores. The input is sorted before slicing. Slices off less if proportion results in a non-integer slice index (i.e., conservatively slices off proportiontocut ).

jvdp1 commented 11 months ago

Would such a (trivial) implementation suitable for you?

function trimmean_sp_2(x, prop, mask) result(res)
 real, intent(in) :: x(:,:)
 real, intent(in) :: prop
 logical, intent(in), optional :: mask(:,:)
 real :: res

 integer :: n
 integer :: cutoff
 real, allocatable :: x_(:)

 if(prop.gt.0.5)then
  res = 0 !or NaN ?
  return
 endif

 if(present(mask))then
  n = count(mask)
  x_ = pack(x, mask)
  error stop
 else
  n = size(x)
  x_ = reshape(x, [n])
 endif

 call sort(x_)

 cutoff = int(n * prop)

 res = mean(x_(1+cutoff : n-cutoff))

end function

The implementation with the dim argument should be possible too.