Open jvdp1 opened 4 years ago
Let's talk about the API of mean
, as you suggested. This looks good:
function mean_sp_sp(vector) result(res)
real(sp), intent(in) :: vector(:)
real(sp) :: res
end function
I assume the user facing function will be just mean
.
Regarding the 2D version:
function mean_sp_sp(mat, dim) result(res)
real(sp), intent(in) :: mat(:,:)
integer, intent(in), optional :: dim !dim = 1 or dim = 2
real(sp), allocatable :: res(:)
end function
I would not use allocatable
result. For speed reasons, the lowest level API should assume the user knows the length.
Regarding the dim
parameter name, we have to be careful of its meaning.
Let's first talk about the intrinsic sum
function that's already available and try to understand how that works:
dim
is an integer that excludes that dimension from the sumaxis
is either or an array of integers, in both cases it includes those dimensions in the sumdim
is an integer that includes the dimension in the sum.So NumPy's axis
as well as Matlab's dim
seem to be equivalent. Fortran, on the other hand, chose the opposite convention of using the dim
to exclude a dimension.
Your mean
above uses a Matlab's convention for mean
. NumPy seems to be using a different convention. Not sure which one to choose here.
- Fortran sum: the
dim
is an integer that excludes that dimension from the sum- NumPy sum: the
axis
is either or an array of integers, in both cases it includes those dimensions in the sum- Matlab sum: the
dim
is an integer that includes the dimension in the sum.So NumPy's
axis
as well as Matlab'sdim
seem to be equivalent. Fortran, on the other hand, chose the opposite convention of using thedim
to exclude a dimension.Your
mean
above uses a Matlab's convention formean
. NumPy seems to be using a different convention. Not sure which one to choose here.
For me, Matlab and Fortran dim
option for sum
behave in the same way (e.g., sum(mat(1:4,1:2),dim=2)
will return an array of dimention 4). However, the default behaviour is different between Fortran (returns the sum of the whole array) and Matlab (returns the sum for dim = 1). After checking, my proposition does not follow these rules of dim
.
So I would propose to use the same behaviour as Fortran sum
(with the same name dim
).
It would be also in agreement with the function size
and it behaviour with dim
.
I would not use allocatable result. For speed reasons, the lowest level API should assume the user knows the length.
I agree for speed, while I think it would be very inconvenient. Should the API be something like that:
function mean_sp_sp(mat, n, dim) result(res)
real(sp), intent(in) :: mat(:,:)
integer, intent(in) :: n
integer, intent(in), optional :: dim !dim = 1 or dim = 2
real(sp), allocatable :: res(n)
end function
Regarding using allocatable
as result, I think we can adopt a rule of thumb to not use allocatable unless necessary to make it work, but the priority should be ease of use. If assumed-size result requires more complex API, we should use an allocatable result.
Fortran sum: the dim is an integer that excludes that dimension from the sum
I'm not sure what you mean by "includes" or "excludes", but in case of Fortran, the sum is performed along dim
:
real :: a(2,3,4) = 1
print *, shape(sum(a, 2))
end
outputs:
2 4
This behavior is both Fortrannic and intuitive. It's consistent with numpy, I don't know about Matlab. It seems to me as the only reasonable behavior.
A notable difference between Fortran's sum(x, dim)
and Python's np.sum(x, axis)
(I don't know about Matlab) is that Fortran always reduces the rank by 1 (dim
must be a scalar). With numpy, axis can be a list or a tuple to perform reduction along multiple axes.
stdlib's generic mean
could accept both a scalar or a rank-1 array for dim
:
sum
and others;A minor nit-pick about the API, I suggest that we don't insinuate a vector/matrix as input, as they're linear algebra-specific and imply rank in the name. Instead, I suggest use a more general name considering we should support any number of dimensions. For arrays I just like plain x:
function mean_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) :: res
end function
For a 2-d array, passing the result size as an input is an unacceptable API in my opinion. Is there any other way we could do assumed-size result for reducing a 2-d (or multi-d) array?
Another important point of discussion: Should a mean of an integer array be an integer or a real?
I believe the mean of an integer array should be dp. The result can be converted if needed. Numpy returns a float from mean(int array)
A notable difference between Fortran's
sum(x, dim)
and Python'snp.sum(x, axis)
(I don't know about Matlab) is that Fortran always reduces the rank by 1 (dim
must be a scalar). With numpy, axis can be a list or a tuple to perform reduction along multiple axes.stdlib's generic
mean
could accept both a scalar or a rank-1 array fordim
:
- If scalar, behaves the same way as for
sum
and others;- If array, it invokes the scalar version iteratively to reduce multiple dimensions in a single call; (however perhaps this precludes assumed-size result)
I would suggest to implement first the same behaviour as Fortran sum
. We can see how to implement the Python behaviour later.
A minor nit-pick about the API, I suggest that we don't insinuate a vector/matrix as input, as they're linear algebra-specific and imply rank in the name. Instead, I suggest use a more general name considering we should support any number of dimensions. For arrays I just like plain x:
function mean_sp_sp(x) result(res) real(sp), intent(in) :: x(:) real(sp) :: res end function
Good to know.
For a 2-d array, passing the result size as an input is an unacceptable API in my opinion. Is there any other way we could do assumed-size result for reducing a 2-d (or multi-d) array?
I agree. Below is a possible workaround for 2D arrays (I just tried it; this implementation gives the same behaviour as Fortran sum
):
interface mean
module function mean_1_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) ::res
end function
...
module function mean_2_all_sp_sp(x) result(res)
real(sp), intent(in) :: x(:,:)
real(sp) ::res
end function mean_2_all_sp_sp
...
module function mean_2_sp_sp(x, dim) result(res)
real(sp), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(sp) :: res(size(x)/size(x, dim))
end function mean_2_sp_sp
...
end interface
Another important point of discussion: Should a mean of an integer array be an integer or a real?
IMHO, it should be always a real (for integer and real arrays).
Minor comment, I think this last iteration is better than using the optional argument dim. Let the interface handle the look up on function names. I believe optional arguments hit runtime? Thus leading to slow down?
I don't think we can use optional
because the result is a scalar if dim
is provided and array otherwise, and we need to specify that at compile-time.
@milancurcic this interface don't use optional
(and neither allocatable
; the result is a scalar if dim
is NOT provided, and an array otherwise; the interface handle that):
interface mean
module function mean_1_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) ::res
end function
...
module function mean_2_all_sp_sp(x) result(res)
real(sp), intent(in) :: x(:,:)
real(sp) ::res
end function mean_2_all_sp_sp
...
module function mean_2_sp_sp(x, dim) result(res)
real(sp), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(sp) :: res(size(x)/size(x, dim))
end function mean_2_sp_sp
...
end interface
This has the same behaviour as Fortran sum
for 1D and 2D arrays
@jvdp1 Yes, I was responding to @leonfoks regarding optional. I think your interface is the way to go.
I couldn't find a viable solution for 3 or higher dimensional input arrays though.
@jvdp1 Yes, I was responding to @leonfoks regarding optional. I think your interface is the way to go.
Sorry @milancurcic for the misunderstanding
I couldn't find a viable solution for 3 or higher dimensional input arrays though.
Me neither. Any idea how it is implemented for Fortran sum
?
If the community doesn't find a solution (I don't believe that :) ), should we first implement something for 1D and 2D arrays, and see later how to do it for >2D arrays (as for loadtxt
and savetxt
)?
For integer arrays, the API could look like:
module function mean_1_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:)
real(dp) :: res
end function mean_1_int8_dp
If the array is integer, the result will be always dp
.
@milancurcic @certik @leonfoks I found a workaround (using the Fortran merge
; edit: simpler version than the initial one) for 3D arrays (without using a function with an allocatable array) when dim
is always a scalar. I am not sure about the performance of this workaround, but it allows the function mean
to have the same behaviour as the Fortran sum
, and it could be easily extended to ranks >3.
Function: mean(x)
or mean(x, dim)
(for 1D, 2D, 3D arrays)
As it is now implemented here, it returns the mean of array elements of 1D, 2D, 3D arrays.
If dim
is absent, a scalar with the mean of all elements in x
is returned.
if dim
is present, an array of rank n-1, where n equal the rank of x
, and a shape similar to that of x
with dimension dim
droppped is returned. The returned array contains the mean of the elements of x
along dimension dim
.
The result is of the same type as the array for real
arrays, or is dp
for integer arrays.
Issue: I tried to use pure functions. While it compiled well with manual Makefile, CMake 3.16.1 does not like submodules + pure functions. Am I alone with this issue?
API for dp
interface mean
module function mean_1_dp_dp(x) result(res)
real(dp), intent(in) :: x(:)
real(dp) :: res
end function mean_1_dp_dp
module function mean_1_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:)
real(dp) :: res
end function mean_1_int8_dp
module function mean_2_all_dp_dp(x) result(res)
real(dp), intent(in) :: x(:,:)
real(dp) :: res
end function mean_2_all_dp_dp
module function mean_2_all_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:,:)
real(dp) :: res
end function mean_2_all_int8_dp
module function mean_2_dp_dp(x, dim) result(res)
real(dp), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(dp) :: res(size(x)/size(x, dim))
end function mean_2_dp_dp
module function mean_2_int8_dp(x, dim) result(res)
integer(int8), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(dp) :: res(size(x)/size(x, dim))
end function mean_2_int8_dp
module function mean_3_all_dp_dp(x) result(res)
real(dp), intent(in) :: x(:,:,:)
real(dp) :: res
end function mean_3_all_dp_dp
module function mean_3_all_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:,:,:)
real(dp) :: res
end function mean_3_all_int8_dp
module function mean_3_dp_dp(x, dim) result(res)
real(dp), intent(in) :: x(:,:,:)
integer, intent(in) :: dim
real(dp) :: res( &
merge(size(x,1),size(x,2),mask = 1 < dim, &
merge(size(x,2),size(x,3),mask = 2 < dim )
end function mean_3_dp_dp
module function mean_3_int8_dp(x, dim) result(res)
integer(int8), intent(in) :: x(:,:,:)
integer, intent(in) :: dim
real(dp) :: res( &
merge(size(x,1),size(x,2),mask = 1 < dim, &
merge(size(x,2),size(x,3),mask = 2 < dim )
end function mean_3_int8_dp
end interface
@jvdp1 Great! This is the exactly the solution I was trying to find yesterday but couldn't figure it out. Yes, it looks like it will expand nicely to as many dims as we need. We can do as many as 15, though I never worked with more than 5-d arrays. There's a typo in your interface (missing parentheses):
module function mean_3_dp_dp(x, dim) result(res)
real(dp), intent(in) :: x(:,:,:)
integer, intent(in) :: dim
real(dp) :: res(merge(size(x, 1), size(x, 2), mask = 1 < dim), &
merge(size(x, 2), size(x, 3), mask = 2 < dim))
end function mean_3_dp_dp
I am not too concerned about performance here. Arguments to merge
are all scalar so this should reduce internally to if-branches at compile time.
Also, for stdlib I'd like to stress and re-iterate, easy of use and nice API should take priority over performance. Let's worry about making a great API first, then if needed work on performance within constraints of a great API design.
Yes, it looks like it will expand nicely to as many dims as we need. We can do as many as 15, though I never worked with more than 5-d arrays.
I use fypp
to generate the files. I will give a try to extend them to 15 dimensions.
There's a typo in your interface (missing parentheses):
Sorry. Too fast copy-paste...
I am not too concerned about performance here. Arguments to
merge
are all scalar so this should reduce internally to if-branches at compile time.
Me neither. For such functions, I prefer the functionality than efficiency. If I need efficiency, I would probably implement it myself anyway.
Also, for stdlib I'd like to stress and re-iterate, easy of use and nice API should take priority over performance. Let's worry about making a great API first, then if needed work on performance within constraints of a great API design.
Currently the proposed API for mean
is the same as Fortran sum
, maxval
, minval
(if we ignore the mask
argument).
Would it be possible to implement a mask
in mean
?
I agree to try to stick to Fortran conventions. Regarding performance, I would say great API and great performance are equal --- we can sacrifice a little bit of one to get a lot of the other, on a case by case basis. We should try not to sacrifice a lot of either.
I think everybody agrees on the following API of mean
which follows Fortran conventions, and similar to Fortran sum
, minval
, maxval
,...
result = mean(x)
result = mean(x, dim)
with x
being an array of type 'integer' and 'real', and 'dim' a scalar with a value in the range from 1 to n (with n <=15), where n equals the rank of x
.
The same Fortran conventions should be used for other similar functions (median, variance, standard deviation, geometric mean,...).
For performance, the current implementation might be good with merge
. I tried to use pure functions, but it seems there is an issue with CMake and submodules. do concurrent
could then be used.
@certik @milancurcic @leonfoks @ivan-pi @scivision Should I write a spec and submit a PR? Or do we need another iterate on the API?
Great work on this interface!
My only concern is what will be the default behavior if the user passed dim
is larger than the rank of the array? Should the code simply fail?
do concurrent
could then be used.
Could do concurrent
modify the calculation sequence, meaning slightly different round-off errors upon each evaluation? (e.g. when compiled with the -qopenmp flag in Intel Fortran).
My only concern is what will be the default behavior if the user passed
dim
is larger than the rank of the array? Should the code simply fail?
I just added a
call error_stop("ERROR (mean): wrong dimension")
The issue is that the functions cannot be pure
anymore.
Could do concurrent modify the calculation sequence, meaning slightly different round-off errors upon each evaluation? (e.g. when compiled with the -qopenmp flag in Intel Fortran).
I think so (at least if auto-parallelisation or OpenMP is used).
Returns the mean of all the elements of ARRAY, or of the elements of ARRAY along dimension DIM.
RESULT = mean(ARRAY)
RESULT = mean(ARRAY, DIM)
ARRAY: Must be an array of type INTEGER, or REAL.
DIM (optional): Must be a scalar of type INTEGER with a value in the range from 1 to n, where n equals the rank of ARRAY.
If ARRAY is of type REAL, the result is of the same type as ARRAY. If ARRAY is of type INTEGER, the result is of type as double precision.
If DIM is absent, a scalar with the mean of all elements in ARRAY is returned. Otherwise, an array of rank n-1, where n equals the rank of ARRAY, and a shape similar to that of ARRAY with dimension DIM dropped is returned.
program test
use stdlib_experimental_stat, only: mean
implicit none
real :: x(1:6) = (/ 1., 2., 3., 4., 5., 6. /)
print *, mean(x) !returns 21.
print *, mean( reshape(x, (/ 2, 3 /) )) !returns 21.
print *, mean( reshape(x, (/ 2, 3 /) ), 1) !returns (/ 3., 7., 11. /)
end program
@certik @milancurcic @nncarlson Is such a specification document (in Markdown) desired alongside the module?
If this API for mean
is acceptable and also the document, I will create a PR for discussing its implementation (style, efficiency, ...). I implemented it up to 15 dimensions (with fypp
).
When the implementation will be ok, the implementation of a mask
should be easy.
This mean
function may also be used as a reference for other descriptive statistics (e.g., variance).
Thanks @jvdp1!
Is such a specification document (in Markdown) desired alongside the module?
I'd say yes.
Editorial nit-picks:
array
and dim
instead of ARRAY and DIM;[2, 3]
instead of (/2, 3/)
. This should go into style guide as well;stats
instead of stat
, which can be confused for "status".@jvdp1 thank you for starting this!
I suggest we put it along side the module for now. Later on, I would like to have some automatic mechanism to parse these semantically (i.e., the tool would understand the sections as well as perhaps even the Fortran code) and produce nice online and pdf documentations.
See the draft PR #153 with a proposal for the k-th order central moment of an array.
Question: should such a function only support central moment, or should it also support absolute central moment, raw moment, and absolute raw moment?
See the draft PR #153 with a proposal for the k-th order central moment of an array.
A function moment()
was implemented, and provided only the k-th central moment.
Based on the discussion in #153, the PR #161 proposes a function moment
that provides raw and central moments. The means must be provided by the user. Go to #161 for more discussion.
In the email lists of the ISO C++ Study Group 19 (Machine Learning) I found the following proposal for simple statistical functions: https://docs.google.com/document/d/1VAgcyvL1riMdGz7tQIT9eTtSSfV3CoCEMWKk8GvVuFY/edit
The proposed functions include the mean, median
, mode
, population_stddev
, sample_stddev
, population_var
and sample_var
.
@certik Do you know if the ISO Fortran committee also has such working groups for different application fields?
The Fortran committee has a group for HPC. I think that is the only application field --- the committee does not have many people, but as the committee grows, it could have more working groups in the future. CCing @sblionel, he might have some thoughts on this.
I don't have an adequate background in statistics to know if people writing statistics applications in Fortran would find it useful to have these functions in the language rather than in a separate library, nor even if that is a sizeable body of users. Fortran's emphasis is on scientific and engineering programming, of which HPC is closely related. I could certainly see these in some sort of stdlib - what would be the advantage of making them intrinsics? Would programmers find that these simple interfaces are adequate for their needs, or are they likely to look for something with more options?
The committee members largely have engineering/science backgrounds, so we don't have a lot of experience with other application fields, and this complicates our ability to judge usefulness of features aimed at other fields. We're always looking for new members, especially if they add diversity to the committee background. Unfortunately, the structure of standards work, especially in the US (but also in some other countries) places financial burdens on those who wish to contribute. J3 (US) works around this by naming people as alternates, which costs nothing, but alternates can't vote if their principal is present.
It's certainly something worth thinking about, but my preference would be to have it be in stdlib first and see how well it is accepted.
I don't have an adequate background in statistics to know if people writing statistics applications in Fortran would find it useful to have these functions in the language rather than in a separate library, nor even if that is a sizeable body of users. Fortran's emphasis is on scientific and engineering programming, of which HPC is closely related. I could certainly see these in some sort of stdlib - what would be the advantage of making them intrinsics? Would programmers find that these simple interfaces are adequate for their needs, or are they likely to look for something with more options?
The committee members largely have engineering/science backgrounds, so we don't have a lot of experience with other application fields, and this complicates our ability to judge usefulness of features aimed at other fields. We're always looking for new members, especially if they add diversity to the committee background. Unfortunately, the structure of standards work, especially in the US (but also in some other countries) places financial burdens on those who wish to contribute. J3 (US) works around this by naming people as alternates, which costs nothing, but alternates can't vote if their principal is present.
It's certainly something worth thinking about, but my preference would be to have it be in stdlib first and see how well it is accepted.
These are functions we use often (or even daily) in my field (i.e. quantitative genetics), and therefore we re-implement these functions quite often (or we swicht to Octave/R/Julia to compute means, variances, regression coefficients, R2,...).
Having them as intrinsics could be useful, or at least in stdlib
(e.g. as here, with many options).
For me, the main advantage of having them as instrinsics would be to not link a simple program to a librabry out of which I just need var
or moment
functions.
Thanks @sblionel for your answer. I do not see a need to have these available as intrinsic procedures, but I do believe having them in a library such as this one can ease the experience for (new) Fortran users. An off-topic question: how does membership in national committees work? Browsing through the documents on the WG5 website, I had the feeling the national committee used to play an important role in driving Fortran development.
In Alan Miller's Fortran Software there are many statistical functions, indicating that Fortran was used in this field. I also have a copy of the book "Programming for the social sciences: Algorithms and Fortran77 Coding" (from 1986), which discusses simple statistical functions. The book "Introduction to Computational Economics Using Fortran" also rolls its own versions of these functions.
Today, I am sure the majority of programmers prefer interpreted languages (Python, Julia, Matlab/Octave, R) for such work, or even spreadsheet programs (Excel, GraphPad, Origin, SPSS, etc.).
Re interpreted languages: I guess that is true for interactive use where you want to explore the data, but if you run into large amounts of information (say remote sensing images), a compiled language comes in quite handy. I think a comprehensive set of statistical functions would be most welcome. I have scanned Alan Millers' website for ideas myself :). And a lot of his software is rather more advanced than a mere mean value or other descriptive statistics.
Another great source of algorithms is the work of Michel Olagnon. A bit of googling gave me this URL: http://www.fortran-2000.com/rank/ (and similar ones). Michel used to be an active contributor to comp.lang.fortran.
Regards,
Arjen
Op wo 29 jul. 2020 om 11:56 schreef Ivan notifications@github.com:
Thanks @sblionel https://github.com/sblionel for your answer. I do not see a need to have these available as intrinsic procedures, but I do believe having them in a library such as this one can ease the experience for users of Fortran. An off-topic question: how does membership in national committees work? Browsing through the documents on the WG5 website, I had the feeling the national committee used to play an important role in driving Fortran development.
In Alan Miller's Fortran Software https://jblevins.org/mirror/amiller/ there are many statistical functions, indicating that Fortran was used in this field. I also have a copy of the book "Programming for the social sciences: Algorithms and Fortran77 Coding" (from 1986), which discusses simple statistical functions. The book "Introduction to Computational Economics Using Fortran" https://www.ce-fortran.com/ also rolls its own versions of these functions.
Today, I am sure the majority of programmers prefer interpreted languages (Python, Julia, Matlab/Octave, R) for such work, or even spreadsheet programs (Excel, GraphPad, Origin, SPSS, etc.).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/113#issuecomment-665565636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR5HQ5T7HD63B27YU6DR57W3LANCNFSM4KIYGOQQ .
An off-topic question: how does membership in national committees work? Browsing through the documents on the WG5 website, I had the feeling the national committee used to play an important role in driving Fortran development.
Briefly, each National Body has its own rules for membership. Typically, you must live in that country or be employed by a company with offices in that country. Each country has its own rules for fees and intellectual property. For Fortran specifically, WG5 (ISO/IEC) delegates the development of the standard to the US NB (PL22.3 aka J3). WG5 sets the feature list and votes on drafts. Practically speaking, there are several non-J3 members who regularly participate in the development work. The only NBs that actively participate in WG5 are Canada, Germany, Japan, UK and US.
Re interpreted languages: with both stdlib and LFortran fully developed in a few years, I expect the experience with Fortran can be very similar as with Julia or Python in terms of interactive usage.
In testing my builds, I am having troubles compiling the statistical modules using Makefile.manual. On my machine the Makefile.manual is invoking gfortran, I suspect version 10.2. stdlib_stats_moment and stdlib_stats_var are taking forever to compile. I am not having this slowdown using Cmake, which I believe also invokes gfortran. Using Makefile.manual gfortran is also issuing a number of warnings that I suspect indicate problems for large arrays. Examples of the warnings are
stdlib_stats_moment.f90:26261:12:
n = count(mask, kind = int64)
1
Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion]
and
stdlib_stats_moment.f90:1312:12:
n = size(x, kind = int64)
1
Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion]
I don't think there is any advantage to invoking count and size with kind=int64 if the results are assigned to a variable of kind int32.
FWIW the command line for the Makefile.manual invocation of gfortran is
gfortran -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all -c stdlib_stats_moment.f90
That looks to me like n
is declared as a real, which is likely an error (although I haven't looked at the code).
You’re right I misread it.
On Sep 27, 2020, at 3:20 PM, Brad Richardson notifications@github.com wrote:
That looks to me like n is declared as a real, which is likely an error (although I haven't looked at the code).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/113#issuecomment-699689719, or unsubscribe https://github.com/notifications/unsubscribe-auth/APTQDOUIQNP7KPO26MEF46LSH6UBLANCNFSM4KIYGOQQ.
That looks to me like
n
is declared as a real, which is likely an error (although I haven't looked at the code).
n
is the denominator of (possibly) multiple operations (divisions) between reals (sp,dp, or qp). Therefore, storing n
as real
requires only 1 conversion.
In testing my builds, I am having troubles compiling the statistical modules using >Makefile.manual. On my machine the Makefile.manual is invoking gfortran, I suspect >version 10.2. stdlib_stats_moment and stdlib_stats_var are taking forever to compile.
There are hundreds of functions to compile in both submodules. Limilting the RANK to 4 might reduce the compilation time.
n is the denominator of (possibly) multiple operations (divisions) between reals (sp,dp, or qp). Therefore, storing n as real requires only 1 conversion.
That makes sense. At some point, as low hanging fruit for somebody, I'd recommend putting in the explicit conversion just to declutter the warning messages at least.
At some point, as low hanging fruit for somebody, I'd recommend putting in the explicit conversion just to declutter the warning messages at least.
At the time of implementation, I recommended to let compiler use mixed-mode arithmetic and not do explicit conversion. I won't dig out the specific comment and thread, but I vaguely remember that this wasn't a universally preferred opinion and we just went with it.
I personally don't appreciate some of gfortran's overly paranoid warnings about correct use of the language. In this specific example, it's okay--it could be useful to a user to know that there is implicit conversion happening. A more grave example is when gfortran warns you about trying to allocate a string on assignment (triggers -Wuninitialized
). This is why I prefer carefully disabling some warnings rather than adding unnecessary explicit stuff to the code just to make the compiler happy.
But it's not only about the compiler. The code may be easier to understand with an explicit real()
conversion. It's just that I am so used to mixed-mode arithmetic that redundant real()
calls only add noise. But that's just me. If people prefer explicit, although unnecessary, type conversions in the code, I agree with the change.
I personally (I think) prefer explicit conversions between arithmetic, it helps to prevent unintended loss of accuracy.
@certik There's no loss of accuracy. Compiler will correctly promote the type as per language rules. This is purely about explicit and verbose vs. implicit and concise.
$ cat mixed_mode1.f90
integer :: a = 3
real :: b = 1.234, c
c = a * b
print *, c
end
$ cat mixed_mode2.f90
integer :: a = 3
real :: b = 1.234, c
c = real(a) * b
print *, c
end
$ gfortran -S mixed_mode1.f90
$ gfortran -S mixed_mode2.f90
$ diff mixed_mode1.s mixed_mode2.s
1c1
< .file "mixed_mode1.f90"
---
> .file "mixed_mode2.f90"
5c5
< .string "mixed_mode1.f90"
---
> .string "mixed_mode2.f90"
@milancurcic I don't think there is any warning in the example you posted:
integer :: a = 3
real :: b = 1.234, c
c = a * b
print *, c
end
Such usage is fine and indeed there is no loss of accuracy. However in this example:
integer :: a = 3, c
real :: b = 1.234
c = a * b
print *, c
end
You get a warning and I personally like this warning, because you lose accuracy, and it might not have been intended by me:
$ gfortran -Wall a.f90
a.f90:3:4:
c = a * b
1
Warning: Possible change of value in conversion from REAL(4) to INTEGER(4) at (1) [-Wconversion]
Okay, yours is a better example. But explicit cast still doesn't help you:
$ cat mixed_mode3.f90
integer :: a = 3, c
real :: b = 1.234
c = real(a) * b
print *, c
end
$ gfortran -Wall mixed_mode3.f90
mixed_mode3.f90:3:4:
c = real(a) * b
1
Warning: Possible change of value in conversion from REAL(4) to INTEGER(4) at (1) [-Wconversion]
How would you in this example use explicit cast to get rid of the warning?
Sorry, I don't think this is a good example either.
The relevant example is one upthread:
n = size(x, kind = int64)
1
Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion]
where you can put that inside of a real()
as Brad suggested.
My point is that adding a real()
here won't do anything for precision, but only for readability.
The example I posted gets fixed by explicit cast to an int (saying as a user to the compiler that I am explicitly trimming the real to an integer):
integer :: a = 3, c
real :: b = 1.234
c = int(a * b)
print *, c
end
In the example you posted:
n = size(x, kind = int64)
The issue is that n
must be real(4)
(btw, shouldn't that be an integer)? The warnings comes from the fact that the 64bit integer can't always fit into 32 bit real, and thus you lose accuracy. I.e., this works (no warning, no loss of accuracy):
integer(4) :: a = 3
real(4) :: b = 1.234
b = a
print *, b
end
but this produces a warning (and there is a possible loss of accuracy if the integer was large enough):
integer(8) :: a = 3
real(4) :: b = 1.234
b = a
print *, b
end
As a user, I personally like that, as it almost always (for me) means I made a mistake and didn't realize there is a loss of accuracy in the conversion. If I want it, I can always type it explicitly, then it is clear to both the (human) reader as well as the compiler what is intended.
Good point, I didn't consider all the possibilities and especially the one of integer being too large. Jeremie explained here why n
is real.
It looks like the warning worked as expected: you didn't realize that there is a possible loss of accuracy. :) And the fix is to put an explicit cast to real(4)
in the code, then it is clear to everybody.
Update: however the compiler warning should have been worded differently: it should say that there is a possible loss of accuracy because the integer(8)
might not always fit into real(4)
.
It looks like the warning worked as expected: you didn't realize that there is a possible loss of accuracy. :) And the fix is to put an explicit cast to
real(4)
in the code, then it is clear to everybody.
Thank you for the explanations. I will have a look and open a PR with explicit casts for these several warnings.
Now, to be completely precise, you lose accuracy (in principle) any time you assign integer to real even of the same size. Here is an example:
integer(4) :: a = 1234567890
real(4) :: b = 1.234
b = a
print *, a
print *, b
end
Which does not warn, but prints:
1234567890
1.23456794E+09
So the last digit is now 4
instead of 0
. But I assume this is such a common operation (32bit integer to 32bit real) that the compiler does not warn by default (you only lose "a little" of accuracy), and only warns if you assign a 64bit integer to 32bit real, as you might lose a lot (half the digits of the integer number in some cases I think).
Overview
It would be nice to have a module in
stdlib
that provides functions for computing means,variances, medians, ... of vectors, and of rows (columns) of 2D-arrays (at least). E.g.,The same could be implemented for variance, median, ... So the API of all functions would be (almost) the same.
API
Let 's discuss the API of only
mean
for a vector first, and then for an array.For a vector:
For a 2D array:
If
dim = 1
, it returns the mean of each row (sores(1:size(mat,1))
). Ifdim = 2
, it returns the mean of each column (sores(1:size(mat,2))
).Here (generated manually with
fypp
) is an example formean
instdlib
.The same API could be used for variance, median, cumulative sum, geometric mean, ...
Should we support arrays of rank > 2? E.g., what would return
mean(mat(:,:,:,:), dim =3)
?Should we use functions or subroutine (and overload
=
)?:The result of the procedure would be of the same kind as the input, and (implicit) conversion would be performed by the user. Functions could then be used.
Alternatively: For real arrays, procedures would return a result of the same kind, or of a lower kind, of the argument (e.g., a mean of a
dp
array would return the result insp
ordp
). All computations inside the procedure would be performed in the same kind as the input array, and the result would be converted just before the function returns the result. For integer arrays, procedures would return a result of a real kind (e.g., a mean of aint64
array would return the result insp
,dp
, orqp
). All computations inside the procedure would be performed in the same kind as the result.Implementation
Probably most of us have some implementations. @leonfoks has also an implementation for 1D array on Github. I would think about a module called
stdlib_experimental_stat.f90
and multiple submodules (one per stat, e.g.,stdlib_experimental_stat_mean.f90
, that contains all functions related with that stat). The first PR would contain only one stat, e.g.mean
to facilitate the discussion.Currently in
stdlib
mean (
mean
) variance (var
) central moment (moment
)Possible additional functions
standard deviation (
std
) median (median
) mode (mode
)Others
covariance (
cov
) correlation (corr
)Other languages
Matlab Numpy Octave R