Open milancurcic opened 4 years ago
@urbanjost Thanks for the example! Your approach is quite elegant and much shorter than the preprocessed version, I agree!
One big disadvantage of any class(*)
based solution is, however, that it prohibits extensions by the consumers of the library. Assuming a routine, like your anyscalar_to_double()
function is part of the library. A consumer of the library would not be able to extend it to additional data types without changing a library itself. If we have routines with specific types hidden behind a generic interface instead, a consumer can add further routines with additional data types using that generic name in his code without having to touch the standard library itself.
Thanks @urbanjost for the class
example. I like class
and use it for some specific things.
However, I would argue against it for the implementation of mean
for mainly three reasons (in addition to @aradi 's one):
class
, we would still need different functions for the real
arrays, because mean
must return a result of the same type as the input for real
arrays. It is only when integer
arrays are provided to mean
that dp
results are returned. The current behaviour of mean
is in agreement with the intrinsic sum
function, and I believe it is a nice feature.stdlib
; the user does not care about the readibility of the code or how things are implemented in the library. In this case of mean
I would say we still must focus on efficiency.class
that may not be supported by all compilers (especially by the older compilers).Finally I think class
could be useful for the subroutines loadtxt
and savetxt
where preprocessing is also used to repeat the code for different kinds.
Well, the main argument against implementing mean(x) this way is that the meat of the computation is
sum(x)/real(size(x))
which is AREADY rank and type independent because the involved intrinsics are already generic. Generally, writing a trivial, one-line function is bad programming practice. Just write the line in the spot where the function is referenced. A more sensible example might be helpful.
Cheers, Bill
On Jan 30, 2020, at 12:16 PM, Jeremie Vandenplas notifications@github.com wrote:
Thanks @urbanjost for the class example. I like class and use it for some specific things.
However, I would argue against it for the implementation of mean for mainly three reasons (in addition to @aradi 's one):
• using class, we would still need different functions for the real arrays, because mean must return a result of the same type as the input for real arrays. It is only when integer arrays are provided to mean that dp results are returned. The current behaviour of mean is in agreement with the intrinsic sum function, and I believe it is a nice feature. • I think we also need to focus a bit on efficiency inside stdlib; the user does not care about the readibility of the code or how things are implemented in the library. In this case of mean I would say we still must focus on efficiency. • the current implementation is probably more portable across compilers than class that may not be supported by all compilers (especially by the older compilers). Finally I think class could be useful for the subroutines loadtxt and savetxt where preprocessing is also used to repeat the code for different kinds.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Bill Long longb@cray.com Principal Engineer, Fortran Technical Support & voice: 651-605-9024 Bioinformatics Software Development fax: 651-605-9143 Cray, a Hewlett Packard Enterprise company/ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
Well, the main argument against implementing mean(x) this way is that the meat of the computation is...
@longb I am not sure which way you meant. From your comment, it understand you are against the mean
function because it is a simple one. First I implemented it with loops, that have been removed through the review (to help for the clarity of the code mainly (and it was a good thing IMO)).
Otherwise, the main goal is to develop a module with multiple statistical functions (mean, variance, median, ....) inside stdlib
. A module about statistics without mean
would make little sense to me. Would it?
I think that this submodule could serve as an example for other functions that will be include in stdlib_experimental_stats
(it also served as spotting some issues, like how to return NaN
), like computing the variance with the Welford 's method:
function variance(x) result(res)
real, intent(in) :: x(:)
real :: res
integer :: i
real :: M, S, delta
M = 0.
S = 0.
do i = 1, size(x)
delta = x(i) - M
M = M + delta/real(i)
S = S + delta * (x(i)-M)
enddo
res = S/real(size(x)-1))
end function
Is this a more sensible example, assuming that it would be extended to have an API compatible with sum
(i.e., with dim
and mask
included)?
Indeed, the only reason we started with mean
was because it was the simplest. More complicated functions will follow soon. Finally, yes, it could be easily computed by hand using sum
and size
, but other libraries have such functions also, such as NumPy's mean or Matlab's mean (for examples in other languages see #113).
Comparison with Python or Matlab does not seem relevant. In both cases, native code is very slow, and they have to resort to libraries for any reasonable performance. Most compiled languages don’t have native support for arrays. Fortran does.
On Jan 30, 2020, at 4:14 PM, Ondřej Čertík notifications@github.com wrote:
Indeed, the only reason we started with mean was because it was the simplest. More complicated functions will follow soon. Finally, yes, it could be easily computed by hand using sum and size, but other libraries have such functions also, such as NumPy's mean or Matlab's mean (for examples in other languages see #113).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Bill Long longb@cray.com Principal Engineer, Fortran Technical Support & voice: 651-605-9024 Bioinformatics Software Development fax: 651-605-9143 Cray, a Hewlett Packard Enterprise company/ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
Interesting example, though the code for Welford’s method does not vectorize and would not perform well on modern processors. I tried a simple example of the direct code and your welford example:
program test_var use,intrinsic :: iso_fortran_env, only: int64, real64 implicit none real :: x(1000), v integer(int64) :: t1,t2,t3 real(real64) :: rate
call random_number(x)
call system_clock(t1, count_rate=rate) v = variance(x) call system_clock(t2) print *, v, (t2-t1)/rate, " sec for Welford method"
call system_clock(t2) v = var(x) call system_clock(t3) print *, v, (t3-t2)/rate, " sec for simple method"
contains
function var(x) result (res) real,intent(in) :: x(:) real :: res
real :: M, S integer :: i
M = sum(x)/real(size(x)) S = sum( (x-m)**2 ) res = S/real(size(x)-1)
end function var
function variance(x) result(res) real, intent(in) :: x(:) real :: res
integer :: i real :: M, S, delta
M = 0. S = 0. do i = 1, size(x) delta = x(i) - M M = M + delta/real(i) S = S + delta * (x(i)-M) enddo
res = S/real(size(x)-1)
end function variance
end program test_var
The output on a generic Intel processor that has SSE registers:
ftn test.f90 ./a.out 8.192101866E-2, 7.15724721261053416E-6 sec for Welford method 8.192101866E-2, 7.80853517877739328E-7 sec for simple method
As you can see, the “simple” method is faster.
Cheers, Bill
On Jan 30, 2020, at 1:53 PM, Jeremie Vandenplas notifications@github.com wrote:
Well, the main argument against implementing mean(x) this way is that the meat of the computation is...
@longb I am not sure which way you meant. From your comment, it understand you are against the mean function because it is a simple one. First I implemented it with loops, that have been removed through the review (to help for the clarity of the code mainly (and it was a good thing IMO)).
Otherwise, the main goal is to develop a module with multiple statistical functions (mean, variance, median, ....) inside stdlib. A module about statistics without mean would make little sense to me. Would it?
I think that this submodule could serve as an example for other functions that will be include in stdlib_experimental_stats (it also served as spotting some issues, like how to return NaN), like computing the variance with the Welford 's method:
function variance(x ) result(res)
real, intent(in) :: x(:)
real :: res
integer :: i
real :: M, S, delta
M = 0 . S = 0 .
do i = 1, size (x) delta = x(i) - M M = M + delta/real (i) S = S + delta * (x(i)- M)
enddo
res = S/real(size(x)-1 ))
end function Is this a more sensible example, assuming that it would be extended to have an API compatible with sum (i.e., with dim and mask included)?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Bill Long longb@cray.com Principal Engineer, Fortran Technical Support & voice: 651-605-9024 Bioinformatics Software Development fax: 651-605-9143 Cray, a Hewlett Packard Enterprise company/ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
Comparison with Python or Matlab does not seem relevant. In both cases, native code is very slow, and they have to resort to libraries for any reasonable performance.
They do not do it for performance reasons, but for convenience reasons. If you look at NumPy's implementation:
this is not more performing than doing it by hand in NumPy. So performance is not the reason why NumPy has the API. Rather, I suspect, the reason is that users like to use such a function.
@longb The comparison to Python and MATLAB is quite relevant because it informs our API design. We share much of the target audience with those and some other languages.
Welford's method takes care of two things. Possible numerical instability; it's a single pass algorithm. It's a choose your poison scenario as with most robust computation algorithms.
Regardless of the examples we use, and implementation we finally decide on for a stdlib that is generally applicable, I also think a lot of people will appreciate a centralized statistics based module. More and more people are using Python and Matlab over Fortran, following their direction at this point is absolutely the way to go, and saves us a lot of effort too!
Discussions about individual algorithms is key to developing the best choice in each case. If the choice of any one algorithm is a bottleneck in a user's code, hopefully they give us some feedback, but they are also free to roll their own version of that algorithm.
If a statistic or any kind of numerical algorithm is to make it in the stdlib, I think it should be highly optimized or not implemented at all (since, as Bill Long said, Fortran has enough intrinsic to not need a function for average). If the algorithm is sub-optimal and not using the best vectorization possible then nobody will use it and it will be dead. The whole point of using Fortran is that here compiler can optimize things but many of these optimizations are not possible if the calculations are closed behind calls to libraries. Simple procedures (such as mean) particularly "feel" that penalty. More complicated ones, not so much. So maybe that could be critera what makes it into stdlib numerical part? Is the procedure simple enough that calling it would introduce a significant penalty? After all, I believe what makes it into stdlib should be in the end considered exemplary and the best way of doing things.
czw., 30 sty 2020 o 23:52 Leon Foks notifications@github.com napisał(a):
Welford's method takes care of two things. Possible numerical instability; it's a single pass algorithm. It's a choose your poison scenario as with most robust computation algorithms.
Regardless of the examples we use, and implementation we finally decide on for a stdlib that is generally applicable, I also think a lot of people will appreciate a centralized statistics based module. More and more people are using Python and Matlab over Fortran, following their direction at this point is absolutely the way to go, and saves us a lot of effort too!
Discussions about individual algorithms is key to developing the best choice in each case. If the choice of any one algorithm is a bottleneck in a user's code, hopefully they give us some feedback, but they are also free to roll their own version of that algorithm.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/3?email_source=notifications&email_token=AC4NA3LEUVDMBU4WOJ7G47DRANKZXA5CNFSM4J25FZ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKM3Q7Q#issuecomment-580499582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4NA3LB5BZLSAKZTXSQDSLRANKZXANCNFSM4J25FZ6Q .
Sorry for the spam, one more thought. There was an argument some time ago along the lines "procedures should not be standarized/implemented if they are simple to write". While I generally disagree with that statement, I think numerical algorithms is where some exception needs to be made, because Fortran excels in numerical computation (unlike in everything else) and its syntax allow writing most things in the most easy and efficient way. So I would dare to argue that numerical part should be the least focus of the stdlib because it is the part that least needs "fixing". What do you think?
pt., 31 sty 2020 o 01:04 Dominik Gronkiewicz gronki@gmail.com napisał(a):
If a statistic or any kind of numerical algorithm is to make it in the stdlib, I think it should be highly optimized or not implemented at all (since, as Bill Long said, Fortran has enough intrinsic to not need a function for average). If the algorithm is sub-optimal and not using the best vectorization possible then nobody will use it and it will be dead. The whole point of using Fortran is that here compiler can optimize things but many of these optimizations are not possible if the calculations are closed behind calls to libraries. Simple procedures (such as mean) particularly "feel" that penalty. More complicated ones, not so much. So maybe that could be critera what makes it into stdlib numerical part? Is the procedure simple enough that calling it would introduce a significant penalty? After all, I believe what makes it into stdlib should be in the end considered exemplary and the best way of doing things.
czw., 30 sty 2020 o 23:52 Leon Foks notifications@github.com napisał(a):
Welford's method takes care of two things. Possible numerical instability; it's a single pass algorithm. It's a choose your poison scenario as with most robust computation algorithms.
Regardless of the examples we use, and implementation we finally decide on for a stdlib that is generally applicable, I also think a lot of people will appreciate a centralized statistics based module. More and more people are using Python and Matlab over Fortran, following their direction at this point is absolutely the way to go, and saves us a lot of effort too!
Discussions about individual algorithms is key to developing the best choice in each case. If the choice of any one algorithm is a bottleneck in a user's code, hopefully they give us some feedback, but they are also free to roll their own version of that algorithm.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/3?email_source=notifications&email_token=AC4NA3LEUVDMBU4WOJ7G47DRANKZXA5CNFSM4J25FZ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKM3Q7Q#issuecomment-580499582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4NA3LB5BZLSAKZTXSQDSLRANKZXANCNFSM4J25FZ6Q .
To answer that, we need to have benchmarks.
And as a user, I feel the compilers are not doing their job if they can't inline functions like mean(). That is one of the motivations why I started LFortran, to try to fix such deficiencies. This is long term, obviously.
The idea of all the efforts that we are doing (J3 GitHub, stdlib, fpm, as well as new compilers) is to fix all the things that have been bothering us about Fortran. It's not going to get fixed overnight, but it will get fixed eventually.
On Thu, Jan 30, 2020, at 5:05 PM, Dominik Gronkiewicz wrote:
If a statistic or any kind of numerical algorithm is to make it in the stdlib, I think it should be highly optimized or not implemented at all (since, as Bill Long said, Fortran has enough intrinsic to not need a function for average). If the algorithm is sub-optimal and not using the best vectorization possible then nobody will use it and it will be dead. The whole point of using Fortran is that here compiler can optimize things but many of these optimizations are not possible if the calculations are closed behind calls to libraries. Simple procedures (such as mean) particularly "feel" that penalty. More complicated ones, not so much. So maybe that could be critera what makes it into stdlib numerical part? Is the procedure simple enough that calling it would introduce a significant penalty? After all, I believe what makes it into stdlib should be in the end considered exemplary and the best way of doing things.
czw., 30 sty 2020 o 23:52 Leon Foks notifications@github.com napisał(a):
Welford's method takes care of two things. Possible numerical instability; it's a single pass algorithm. It's a choose your poison scenario as with most robust computation algorithms.
Regardless of the examples we use, and implementation we finally decide on for a stdlib that is generally applicable, I also think a lot of people will appreciate a centralized statistics based module. More and more people are using Python and Matlab over Fortran, following their direction at this point is absolutely the way to go, and saves us a lot of effort too!
Discussions about individual algorithms is key to developing the best choice in each case. If the choice of any one algorithm is a bottleneck in a user's code, hopefully they give us some feedback, but they are also free to roll their own version of that algorithm.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub
https://github.com/notifications/unsubscribe-auth/AC4NA3LB5BZLSAKZTXSQDSLRANKZXANCNFSM4J25FZ6Q .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/3?email_source=notifications&email_token=AAAFAWGSWDMRIH7HII4ZMK3RANTLVA5CNFSM4J25FZ62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKNAOYI#issuecomment-580519777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWH5IP3UTOPCQDK4XXTRANTLVANCNFSM4J25FZ6Q.
Even simple routines should consider robustness not just speed; and in a perfect world avoid pitfalls users are not always considering, such as data conditioning. Are you more impressed by a compiler whose SUM() function returns the value of OUTLIER() on the first call or one that generates the last value calculated no matter where the OUTLIER() value is placed in the ARR() array?
program testit
integer,parameter :: elements=10000000
real :: arr(elements)
real :: summary
real :: outlier=huge(summary)/10.0**30
integer :: i
arr=1.0
arr(1)=outlier
write(*,*)'biggest=',arr(1)
summary=0.0
do i=1,elements
summary=summary+arr(i)
enddo
write(*,*)'loop sum biggest first=',summary
write(*,*)'biggest first sum=',sum(arr)
summary=0.0
arr(1)=1.0
arr(elements)=outlier
do i=1,elements
summary=summary+arr(i)
enddo
write(*,*)'loop sum biggest last=',summary
write(*,*)'biggest last sum=',sum(arr)
end program testit
Do you get this:
biggest= 340282336.
loop sum biggest first= 340282336.
biggest first sum= 340282336.
loop sum biggest last= 350282336.
biggest last sum= 350282336.
or is "biggest first sum" 350282336. ?
This is an artificial case, but reflects real-world errors that often happen in real-world problems. So I can easily imagine a MEAN() function that is a lot more substantial than just replacing a single line; and a MASK ability is very useful but can run into similar issues like having a mask that says X.ne.REAL_VALUE, for example. Does speed outweigh robustness in stdlib? I have to admit I have seen fewer intrinsic implementations that condition data lately. Is the cost of such checks so high we not want them for the "typical" datasets?
There can be different versions of sum
, e.g., Julia has a regular sum
and also sum_kbn for a much slower, but more accurate sum algorithm.
(All my production applications do not depend on the order of summation, but for some users that can be helpful, and in that case they can call such different method.)
Anyway, this is not related to the style guide. Please open separate issues for that. And even better, let's start discussing and submitting PRs for all this functionality, so that we can get towards our goal sooner.
As you can see, the “simple” method is faster.
If a statistic or any kind of numerical algorithm is to make it in the stdlib, I think it should be highly optimized or not implemented at all (since, as Bill Long said, Fortran has enough intrinsic to not need a function for average). If the algorithm is sub-optimal and not using the best vectorization possible then nobody will use it and it will be dead.
In a general way, IMO the algorithms must be first robust and accurate (at least as mush as the intrinsic functions; see comment) for a vast majority of the cases available in the community (and not for a personal/specific case). What is the point to have a fast and very efficient algorithm that returns a wrong answer in some general cases? E.g., the "simple" method in comment is faster (while I think most of you could optimize the other one), but in many of my real cases this "faster" function may return a wrong answer. So, there is a trade-off between accuracy/robustness and efficiency, and this may vary depending on the real scenarios. We need to weight advantages and disavantages of both (e.g, I would be happy if the faster method is implemented but the spec should mention its limitations if its robustness is below the one of intrinsic functions).
Finally, I think stdlib
should try to answer the need of the community that probably includes a large number of "average" Fortran users , and not the need of the current stdlib
developers (who have most likely already something like stdlib
for their own use). "Average" Fortran users would probably prefer (real case in my field):
integer(int8) :: x(:,:)
print*, mean(x, x < 3) !dp result
over
integer(int8) :: x(:,:)
print*, sum(real(x,dp), x < 3) / real(count(x < 3, kind = int64), dp)
If you are interested in the discussion about descriptive statistics and want to participate to it, please see #113 and #128 . for discussion on a possible trade-off between effeciency and robustness/accuracy, see #134.
I am sorry to have opened a discussion about utility/usefulness/efficiency of specific procedures (mean
, variance
) (it was not my aim). I hope this issue will restart where it was, i.e., here
Should we start a Wiki page for the style guide? I believe we have already converged to some common practices in the current stdlib
modules.
Edit: I forgot we already have the STYLE_GUIDE.md. I will see to create a pull request once I have something.
I've been using the Google C++ Style Guide lately, and found it very helpful to maintain clean code and refresh my memory on best practices. The table of contents is very practical to quickly locate the desired language feature.
A few similar (project-specific) Fortran style guides are:
Other inspiration:
Interesting lists. Particularly the CP2K and UK Met sites that actively advocate to avoid F2003 as being “too new”. This points out an important issue with coding style documents. It is needed to incorporate a scheme for constant updating as the underlying language changes. (It is also helpful to understand how CP2K gets to be so bug-ridden.)
Cheers, Bill
On Feb 15, 2021, at 4:50 AM, Ivan Pribec notifications@github.com wrote:
Should we start a Wiki page for the style guide? I believe we have already converged to some common practices in the current stdlib modules.
I've been using the Google C++ Style Guide lately, and found it very helpful to maintain clean code and refresh my memory on best practices.
A few similar (project-specific) Fortran style guides are:
• @certik 's Fortran Best Practices • DFTB+ Developers Fortran style guide • CP2K Coding Conventions • Object Modeling System Fortran 90/95 Coding Conventions from Colorado State university • Fortran Coding Standards for JULES (UK Met Office) • Fortran coding standard for FCM (UK Met Office) Other inspiration:
• Boost Library Requirements and Guidelines • PEP 8 -- Style Guide for Python Code — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Bill Long longb@hpe.com Engineer/Master , Fortran Technical Support & voice: 651-605-9024 Bioinformatics Software Development fax: 651-605-9143 Hewlett Packard Enterprise/ 2131 Lindau Lane/ Suite 1000/ Bloomington, MN 55425
What's the current consensus on the space in an end block keyword, e.g. enddo
or end do
?
I think the consensus is to use end do
, and simply use the amount of spaces that is already used in a given file for consistency.
I have just found a few unnecessary semicolons at the end of statements in stdlib_ascii.fypp
, for example,
https://github.com/fortran-lang/stdlib/blob/22c1be0b8c47c81223c1f84c854cc582b6528a27/src/stdlib_ascii.fypp#L146
which may be a style issue for those who use both Fortran and C-like languages and hopefully be avoided by the use of some linters in future.
Use this issue to discuss the code style for the stdlib.
The most widely supported elements of style will eventually be merged into the Style Guide for contributors.