Open jvdp1 opened 4 years ago
And the fix is to put an explicit cast to real(4) in the code, then it is clear to everybody.
Well, this is only a fix for clarity of the code, right?
If we really wanted to fix the possible loss of precision, shouldn't we use a real64
for n
to accommodate very large arrays?
$ cat huge.f90
use iso_fortran_env
print *, huge(1_int32), huge(1_int64)
print *, real(huge(1_int64), kind=real32)
print *, real(huge(1_int64), kind=real64)
end
$ gfortran -Wall huge.f90 && ./a.out
huge.f90:3:14:
print *, real(huge(1_int64), kind=real32)
1
Warning: Change of value in conversion from ‘INTEGER(8)’ to ‘REAL(4)’ at (1) [-Wconversion]
huge.f90:4:14:
print *, real(huge(1_int64), kind=real64)
1
Warning: Change of value in conversion from ‘INTEGER(8)’ to ‘REAL(8)’ at (1) [-Wconversion]
2147483647 9223372036854775807
9.22337204E+18
9.2233720368547758E+018
If we really wanted to fix the possible loss of precision, shouldn't we use a
real64
forn
to accommodate very large arrays?
The loss of precision would appear at another stage, because the n
is used as the denominator in the result of the function (that is real(int32) in this case), right?
The kind=int64
in the intrinsic size
is used to avoid issues with arrays of size that does not fit in int32
(which can be easily reached, especially when multiple dimensions are used).
Well, this is only a fix for clarity of the code, right?
Yes. That we have thought about the issue and we "know what we are doing". That it is not an oversight.
The conversion to real32 has a precision of 2-24, and so has a round off error of about 2-25. It is rare to have a precision this high for statistical measurements, for a standard deviation of 0.1% it would require about 230 measurements, i.e., (1/(2-25/2-10)2, but I suppose for some of the fundamental constants it would be important.
So the last digit is now
4
instead of0
. But I assume this is such a common operation (32bit integer to 32bit real) that the compiler does not warn by default (you only lose "a little" of accuracy),
Indeed. Such operations are mentioned by gfortran with the flag -Wconversion-extra
(and there are many of them in stdlib )
Overview
It would be nice to have a module in
stdlib
that provides functions for computing means,variances, medians, ... of vectors, and of rows (columns) of 2D-arrays (at least). E.g.,The same could be implemented for variance, median, ... So the API of all functions would be (almost) the same.
API
Let 's discuss the API of only
mean
for a vector first, and then for an array.For a vector:
For a 2D array:
If
dim = 1
, it returns the mean of each row (sores(1:size(mat,1))
). Ifdim = 2
, it returns the mean of each column (sores(1:size(mat,2))
).Here (generated manually with
fypp
) is an example formean
instdlib
.The same API could be used for variance, median, cumulative sum, geometric mean, ...
Should we support arrays of rank > 2? E.g., what would return
mean(mat(:,:,:,:), dim =3)
?Should we use functions or subroutine (and overload
=
)?:The result of the procedure would be of the same kind as the input, and (implicit) conversion would be performed by the user. Functions could then be used.
Alternatively: For real arrays, procedures would return a result of the same kind, or of a lower kind, of the argument (e.g., a mean of a
dp
array would return the result insp
ordp
). All computations inside the procedure would be performed in the same kind as the input array, and the result would be converted just before the function returns the result. For integer arrays, procedures would return a result of a real kind (e.g., a mean of aint64
array would return the result insp
,dp
, orqp
). All computations inside the procedure would be performed in the same kind as the result.Implementation
Probably most of us have some implementations. @leonfoks has also an implementation for 1D array on Github. I would think about a module called
stdlib_experimental_stat.f90
and multiple submodules (one per stat, e.g.,stdlib_experimental_stat_mean.f90
, that contains all functions related with that stat). The first PR would contain only one stat, e.g.mean
to facilitate the discussion.Currently in
stdlib
mean (
mean
) variance (var
) central moment (moment
)Possible additional functions
standard deviation (
std
) median (median
) mode (mode
)Others
covariance (
cov
) correlation (corr
)Other languages
Matlab Numpy Octave R