Proposal for descriptive statistics

jvdp1 commented 4 years ago

Overview

It would be nice to have a module in stdlib that provides functions for computing means,variances, medians, ... of vectors, and of rows (columns) of 2D-arrays (at least). E.g.,

real :: res
real, allocatable :: res1(:)
real :: vector(5), mat(6,5)
...
res = mean(vector)
res1 = mean(mat) !returns the mean of each row of mat
res1 = mean(mat, dim = 2) !returns the mean of each column of mat

The same could be implemented for variance, median, ... So the API of all functions would be (almost) the same.

API

Let 's discuss the API of only mean for a vector first, and then for an array.

For a vector:

function mean_sp_sp(vector) result(res)
    real(sp), intent(in) :: vector(:)
    real(sp) :: res
end function

For a 2D array:

function mean_sp_sp(mat, dim) result(res)
    real(sp), intent(in) :: mat(:,:)
    integer, intent(in), optional :: dim !dim = 1 or dim = 2 
    real(sp), allocatable :: res(:)
end function

If dim = 1, it returns the mean of each row (so res(1:size(mat,1))). If dim = 2, it returns the mean of each column (so res(1:size(mat,2))).

Here (generated manually with fypp) is an example for mean in stdlib.

The same API could be used for variance, median, cumulative sum, geometric mean, ...

Should we support arrays of rank > 2? E.g., what would return mean(mat(:,:,:,:), dim =3)?

Should we use functions or subroutine (and overload `=`)?:

The result of the procedure would be of the same kind as the input, and (implicit) conversion would be performed by the user. Functions could then be used.

Alternatively: For real arrays, procedures would return a result of the same kind, or of a lower kind, of the argument (e.g., a mean of a dp array would return the result in sp or dp). All computations inside the procedure would be performed in the same kind as the input array, and the result would be converted just before the function returns the result. For integer arrays, procedures would return a result of a real kind (e.g., a mean of a int64 array would return the result in sp, dp, or qp). All computations inside the procedure would be performed in the same kind as the result.

Implementation

Probably most of us have some implementations. @leonfoks has also an implementation for 1D array on Github. I would think about a module called stdlib_experimental_stat.f90 and multiple submodules (one per stat, e.g., stdlib_experimental_stat_mean.f90, that contains all functions related with that stat). The first PR would contain only one stat, e.g. mean to facilitate the discussion.

Currently in `stdlib`

mean (mean) variance (var) central moment (moment)

Possible additional functions

standard deviation (std) median (median) mode (mode)

Others

covariance (cov) correlation (corr)

Other languages

Matlab Numpy Octave R

milancurcic commented 4 years ago

And the fix is to put an explicit cast to real(4) in the code, then it is clear to everybody.

Well, this is only a fix for clarity of the code, right?

If we really wanted to fix the possible loss of precision, shouldn't we use a real64 for n to accommodate very large arrays?

$ cat huge.f90 
use iso_fortran_env
print *, huge(1_int32), huge(1_int64)
print *, real(huge(1_int64), kind=real32)
print *, real(huge(1_int64), kind=real64)
end
$ gfortran -Wall huge.f90 && ./a.out 
huge.f90:3:14:

 print *, real(huge(1_int64), kind=real32)
              1
Warning: Change of value in conversion from ‘INTEGER(8)’ to ‘REAL(4)’ at (1) [-Wconversion]
huge.f90:4:14:

 print *, real(huge(1_int64), kind=real64)
              1
Warning: Change of value in conversion from ‘INTEGER(8)’ to ‘REAL(8)’ at (1) [-Wconversion]
  2147483647  9223372036854775807
   9.22337204E+18
   9.2233720368547758E+018

jvdp1 commented 4 years ago

If we really wanted to fix the possible loss of precision, shouldn't we use a real64 for n to accommodate very large arrays?

The loss of precision would appear at another stage, because the n is used as the denominator in the result of the function (that is real(int32) in this case), right? The kind=int64 in the intrinsic size is used to avoid issues with arrays of size that does not fit in int32 (which can be easily reached, especially when multiple dimensions are used).

certik commented 4 years ago

Well, this is only a fix for clarity of the code, right?

Yes. That we have thought about the issue and we "know what we are doing". That it is not an oversight.

wclodius2 commented 4 years ago

The conversion to real32 has a precision of 2-24, and so has a round off error of about 2-25. It is rare to have a precision this high for statistical measurements, for a standard deviation of 0.1% it would require about 230 measurements, i.e., (1/(2-25/2-10)2, but I suppose for some of the fundamental constants it would be important.

jvdp1 commented 4 years ago

So the last digit is now 4 instead of 0. But I assume this is such a common operation (32bit integer to 32bit real) that the compiler does not warn by default (you only lose "a little" of accuracy),

Indeed. Such operations are mentioned by gfortran with the flag -Wconversion-extra (and there are many of them in stdlib )

fortran-lang / stdlib