Open Beliavsky opened 3 years ago
Thank you for opening this issue. Your propositions are indeed in the scope of stdlib, and IMO would be nice and useful additions (at least for my personal usage). Do all these functions rely on sorting algorithms? Or could some of them already be implemented (if yes, these could be nice first issues)?
Fast Computation of Trimmed Means by Beliakov has a fast algorithm for trimmed means, which is implemented in C and has an associated R package.
The interface of scipy.stats.trim_mean could be emulated:
scipy.stats.trim_mean(a, proportiontocut, axis=0)
Return mean of array after trimming distribution from both tails.
If proportiontocut = 0.1, slices off ‘leftmost’ and ‘rightmost’ 10% of scores. The input is sorted before slicing. Slices off less if proportion results in a non-integer slice index (i.e., conservatively slices off proportiontocut ).
Would such a (trivial) implementation suitable for you?
function trimmean_sp_2(x, prop, mask) result(res)
real, intent(in) :: x(:,:)
real, intent(in) :: prop
logical, intent(in), optional :: mask(:,:)
real :: res
integer :: n
integer :: cutoff
real, allocatable :: x_(:)
if(prop.gt.0.5)then
res = 0 !or NaN ?
return
endif
if(present(mask))then
n = count(mask)
x_ = pack(x, mask)
error stop
else
n = size(x)
x_ = reshape(x, [n])
endif
call sort(x_)
cutoff = int(n * prop)
res = mean(x_(1+cutoff : n-cutoff))
end function
The implementation with the dim
argument should be possible too.
Since sorting and I assume ranking will be part of stdlib, here is some related functionality to consider. In another issue I mentioned the median. The median is a special case of a trimmed mean, where you trim the top N and bottom N values for a vector with 2N+1 or 2N+2 elements and compute the mean of the remaining 1 or 2 values. Trimmed means are a commonly used robust statistic -- one computes the mean after removing top N and bottom N. So a function
could be created. In quantitative finance, insurance, and probably other domains, you often want to know what the mean of N largest or smallest observations were. A complement to the function above would be
Usually only one of nhigh and nlow would be specified by the user, but the case where both are specified can also be handled. In quantitative finance, the mean of the N lowest returns of a sample is an estimator of the Conditional Value at Risk.
The most commonly computed average is just the sample mean, which is in stdlib. Quantities such as the trimmed mean are less trivial to program and would perhaps provide a greater benefit by being in stdlib.
The descriptive statistics section has var to compute variance, which is the square of the standard deviation. A robust alternative to standard deviation to measure the spread of data is the interquartile range (IQR) (the difference of 75th and 25th percentiles). Having percentiles and IQR in stdlib would be nice. If you ask R to summarize 1000 normal variates, it says
So statisticians think percentile measures are as important as the standard deviation to understand spread.