fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.08k stars 167 forks source link

Proposal for statistical distributions #234

Open Jim-215-Fisher opened 4 years ago

Jim-215-Fisher commented 4 years ago

Besides common descriptive statistics, we need standard modules for various continuous statistical distribution (e.g., gamma distribution) and discrete distribution (e.g., bernoulli distribution). These statistical distributions will be very useful to various computer simulation techniques. Even though these functions are available in Scipy package for python, I think it is worthwhile to have in stdlib with pure Fortran. There are plenty of source codes on the net, we just need to convert them.

epagone commented 4 years ago

Prior art.

jvdp1 commented 4 years ago

I agree with @Jim-215-Fisher 's comment. Thank you for the suggestion.

To add to the prior art mentioned by @epagone:

Jim-215-Fisher commented 4 years ago

Should stdlib be based on GSL through fgsl?

certik commented 4 years ago

@Jim-215-Fisher Unfortunately GSL is GPL licensed, which prohibits its inclusion into stdlib which is MIT licensed.

ivan-pi commented 4 years ago

Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/)

arjenmarkus commented 4 years ago

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while.

Op di 22 sep. 2020 om 12:59 schreef Ivan notifications@github.com:

Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-696649490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

jvdp1 commented 4 years ago

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while. Op di 22 sep. 2020 om 12:59 schreef Ivan notifications@github.com: Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#234 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

Indeed, I agree. However, it is unclear to me what the license is for these files.

arjenmarkus commented 4 years ago

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Op di 22 sep. 2020 om 13:07 schreef Jeremie Vandenplas < notifications@github.com>:

i agree with that: it is an extensive collection provided by a well-known professional in the field. It may require some reorganisation, as it is not organised in modules and the like, but it is certainly worth our while. Op di 22 sep. 2020 om 12:59 schreef Ivan notifications@github.com: … <#m-2168024105408715737> Don't forget the Software from Alan J. Miller: https://wp.csiro.au/alanmiller/random.html (a mirror exists at https://jblevins.org/mirror/amiller/) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#234 (comment) https://github.com/fortran-lang/stdlib/issues/234#issuecomment-696649490>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRZSJTTC675SELMLNWLSHB7Q3ANCNFSM4RUXIQFQ .

Indeed, I agree. However, it is unclear to me what the license is for these files.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-696652720, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR2XPP2KHV2B46IZE3LSHCANJANCNFSM4RUXIQFQ .

Jim-215-Fisher commented 4 years ago

If it is public domain, then there is no copyrights.

Jim-215-Fisher commented 4 years ago

In terms of license, as long as there are mathematic formulae, I think license issue should not be a concern.

Based on comments so far, it looks like majority agreed to have statistical distributions in stdlib. So the next question is what kind of API should be. We can either define a data type for each distribution and various type-bound procedures, or define various procedures for each type of distribution directly in the stdlib module. For example, normal distribution one could have:

module stdlib type :: norm_dist_t real :: x real :: loc real :: scale

contains generic :: pdf => norm_dist_pdf generic :: cdf => norm_dist_cdf . . . end type norm_dist_t

contains {procedure definition} . . . end module stdlib

or directly in stdlib module:

module stdlib function norm_dist_pdf(x, loc, scale) end function norm_dist_pdf . . . end module stdlib

The first one is object oriented, data and procedure are encapsulated. The second one is more traditional like intrinsic function call familiar to users. Which one is better?

arjenmarkus commented 4 years ago

In these matters it is not likely that there is a "best" solution. It is a matter of taste, I would say. But I am leaning towards a procedural style in this. Typical use of this functionality:

I do see the attractiveness of an object-oriented interface, especially if you want to examine several different data sets against the same distribution, but it feels indirect in many cases. So I would prefer a functional/procedural interface, at least for the moment. An OO interface can be added later.

Op wo 23 sep. 2020 om 00:56 schreef Jing notifications@github.com:

In terms of license, as long as there are mathematic formulae, I think license issue should not be a concern.

Based on comments so far, it looks like majority agreed to have statistical distributions in stdlib. So the next question is what kind of API should be. We can either define a data type for each distribution and various type-bound procedures, or define various procedures for each type of distribution directly in the stdlib module. For example, normal distribution one could have:

module stdlib type :: norm_dist_t real :: x real :: loc real :: scale

contains generic :: pdf => norm_dist_pdf generic :: cdf => norm_dist_cdf . . . end type norm_dist_t

contains {procedure definition} . . . end module stdlib

or directly in stdlib module:

module stdlib function norm_dist_pdf(x, loc, scale) end function norm_dist_pdf . . . end module stdlib

The first one is object oriented, data and procedure are encapsulated. The second one is more traditional like intrinsic function call familiar to users. Which one is better?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-697025016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR74DUUVPIAE2UBA4WDSHETTHANCNFSM4RUXIQFQ .

jvdp1 commented 4 years ago

My preference is a procedural style. This choice "procedural" vs "OOP" style has been already discussed in previous issues (I can't find them back), and, if I remember wel, procedural style should be prefered over OOP. Of course similar functionality in a OOP style could be always proposed on top of the procedural ones.

Jim-215-Fisher commented 4 years ago

Yeh, I have checked the style guide and can't find any recommendation. Anyway, procedure style is the Fortran way. I will go ahead to implement a small module.

certik commented 4 years ago

@Jim-215-Fisher we should put it into the style guide, great idea. Here are some links where it was discussed in the past:

Jim-215-Fisher commented 4 years ago

@certik Thanks for the links. BTW, is there a table/list/link showing status of each stdlib module/proposal?

certik commented 4 years ago

The latest status is in the open issue and PR for a given proposal. We don't have a nice table summarizing it.

On Wed, Sep 23, 2020, at 7:19 PM, Jing wrote:

@certik https://github.com/certik Thanks for the links. BTW, is there a table/list/link showing status of each stdlib module/proposal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-698055250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWGOUJQG5L5OALIM5ITSHKNDHANCNFSM4RUXIQFQ.

jrblevin commented 4 years ago

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Arjen, thanks for writing to ask about the licensing. When I took over hosting Alan Miller's files no license was stated. So I asked him about that and he told me he intended his code to be public domain. So, his work can be incorporated into libraries, such as this one, with other licenses.

arjenmarkus commented 4 years ago

Hi Jason,

great, this work deserves widespread use. And having it available via the Fortran Wiki and hopefully at some point via the standard library (or something similar) will make it much easier.

Op vr 25 sep. 2020 om 16:09 schreef Jason Blevins <notifications@github.com

:

It says that the code written by Alan Miller is in the public domain, I will ask Jason what this means exactly.

Arjen, thanks for writing to ask about the licensing. When I took over hosting Alan Miller's files no license was stated. So I asked him about that and he told me he intended his code to be public domain. So, his work can be incorporated into libraries, such as this one, with other licenses.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-698950724, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR5H3ALI5OV5IHOSLOLSHSP73ANCNFSM4RUXIQFQ .

jvdp1 commented 4 years ago

Thank oyu @jrblevin and @arjenmarkus for these explanations. These codes would be a good start for this proposal IMO.

Jim-215-Fisher commented 4 years ago

In terms of Alan Miller's code, should we use his code/module directly, or reorganize it according to current style?

arjenmarkus commented 4 years ago

I would say we need to reorganize it - make sure things are consistent within the standard library. That will help people to understand how to use it and to avoid certain types of mistakes.

Op zo 27 sep. 2020 om 02:22 schreef Jing notifications@github.com:

In terms of Alan Miller's code, should we use his code/module directly, or reorganize it according to current style?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/234#issuecomment-699564501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YR3AE3OZ7AA3OV5VG3LSH2ATZANCNFSM4RUXIQFQ .

jvdp1 commented 4 years ago

I agree with @arjenmarkus . We will need to re-organize the code (and probably re-write some parts) to be consistent with stdlib style guide. It is anyway a great start.

It seems that the people involved in this thread would agree to use Alan Miller 's code as a starting point, and that a procedural approach should be used (in agreement with several other discussions). If I may propose, @Jim-215-Fisher would you be interested to start a proposal for API?

Jim-215-Fisher commented 4 years ago

Yes, I am working on it, hopefully send PR soon.

David-Duffy commented 4 years ago

Hi. A certain number of the routines from Alan Miller's site are from other sources - for example, the Applied Statistics routines have a funny licence (no commercial use), which tripped me up when I used them for an R package. The Rmath library might be a good model for what working statisticians want (esp in terms of numerics).

Jim-215-Fisher commented 4 years ago

The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions. The code is in my branch https://github.com/Jim-215-Fisher/stdlib/tree/stats_distribution

jvdp1 commented 4 years ago

The initial implementation of this proposal is ready for PR. Anyone interested is welcome to review it. I have implemented three distributions at the moment, uniform, normal and binomial distribution random number generators and pdf, cdf functions. Will implement more distributions and functions after collecting comments and suggestions. The code is in my branch https://github.com/Jim-215-Fisher/stdlib/tree/stats_distribution

Thank you @Jim-215-Fisher . Can you open a PR? It will be easier for reviewing and discussing the API and code.

esterjo commented 3 years ago

I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say normal_dist) and calls like normal_dist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either.

Also, looking at some of the implementations in the links, I see that a few of them generate 2 uniform random numbers to return only 1 variate. See for example, the Box-Muller implementation in GSL, where they explain their choice. This is may be a bit wasteful as two covariates might be returned instead of 1, when the algorithm produces 2 independent covariates. Saving one of the covariates computed using a SAVE attribute or simply filling entire arrays at once may be a better implementation.

ivan-pi commented 3 years ago

I'd like to call out this RNG library, which I use regularly. There is a Fortran wrapper to the version producing 32bit random values online but I think it's worth implementing the whole thing. I use this library with C++. There the API is that one first instantiates an RNG object (say pcg), and a distribution object (say _normaldist) and calls like _normaldist(pcg) return a normal variate using this particular RNG. I find this API very smooth and flexible. Allows switching out RNGs. But I have no problem with a strictly procedural implementation either.

I was intrigued by the simplicity of this generator and attempted a quick Fortran version. I haven't verified it's correctness yet, but the timings are promising (roughly the same order as the intrinsic random_number subroutines in gfortran and Intel fortran). I have compared the bit sequences of the random integers from this version, and the Fortran wrapper of the C version, and they are equal. The conversion from signed integers to double is still problematic. Portability might also be an issue.

I would suggest preserving the discussion of a random number generator object under issue https://github.com/fortran-lang/stdlib/issues/135.

esterjo commented 3 years ago

Just for reference here is libstdc++'s implementation of the gaussian variate generator, which is a very standard implementation of Marsaglia's algorithm with each call saving one of the two variates generated for the next call: https://gcc.gnu.org/onlinedocs/libstdc++/latest-doxygen/a15735_source.html#l01783

David-Duffy commented 3 years ago

Just looking at Thomas, D. B., Luk. W., Leong, P. H. W., and Villasenor, J. D. 2007. Gaussian random number generators. ACM Comput. Surv. 39, 4, Article 11 (October 2007) http://doi.acm.org/10.1145/1287620.1287622

In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as accurate (ie especially producing the correct number of extreme deviates), uncorrelated, and 5 times faster than Box-Muller). I have a vague memory of Fortran code for this somewhere.

David-Duffy commented 3 years ago

There is a useful discussion of implementing the ziggurat gaussian RNG for numpy here: https://github.com/numpy/numpy/issues/2047

epagone commented 3 years ago

In my code, I use the ratio of uniforms (2 uniforms per single gaussian) algo, but this review suggests a Ziggurat algorithm as accurate (ie especially producing the correct number of extreme deviates), uncorrelated, and 5 times faster than Box-Muller). I have a vague memory of Fortran code for this somewhere.

Maybe this one?

esterjo commented 3 years ago

@David-Duffy you're absolutely right. It appears Ziggurat is the fastest option for gaussians, and doesn't seem to be worst statistically from what I've been reading.

There's a paper on a generalized version to unimodal distributions with unbounded support. According to the authors performance is better than, for example, what is used by GCC's implementation of random.h or in Boost's random.h. The paper gives thorough pseudo-code too, which I appreciate.

EDIT: However, on GPU the reverse seems to hold, because "...in a GPU, the performance of the slow route will apply to all threads in a warp, even if only one thread uses the route. If the warp size is 32 and the probability of taking the slow route is 2 percent, then the probability of any warp taking the slow route is (1 - 0.02)^32, which is 47 percent! So, because of thread batching, the assumptions designed into the ziggurat are violated, and the performance advantage is destroyed."

David-Duffy commented 3 years ago

"a generalized version" - yes, I had been looking at their C++ code, but dreading how how long it would take to fully test a Fortran port. One reason Boost etc are using time-tested algorithms rather than bleeding edge,

rryley commented 3 years ago

The R documents provide a listing of the available probability distributions. Considering another, independent request for better interop with R, it might save some effort to follow the R conventions as much as possible.

Here is the list: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Probability-distributions

Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library? The GNU Scientific Library stats section is limited. I also think the R stats package provides a good model of what should be included, but that might be too ambitious. If you have R installed, open up the REPL and type:

library(help="stats") for a list of what is available.

jvdp1 commented 3 years ago

Is there another discussion on what should be in the statistical methods portion of the Fortran Standard Library?

@rryley Different PRs were already submitted regarding probability distributions (#271, #272, #273, #26, #278, #286, ...). It seems some of them cover the R probability distributions.

peteroupc commented 3 years ago

I want to draw attention in this issue thread to algorithms for geometric and binomial variates published from 2013 through 2015, by Bringmann and colleagues as well as Farach-Colton/Tsai, which were designed especially for small p parameters (geometric) or large numbers of trials (binomial).

See my notes on samplers for both distributions.

REFERENCES: