JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.92k stars 5.49k forks source link

improve RNG #67

Closed JeffBezanson closed 13 years ago

JeffBezanson commented 13 years ago

We already use mersenne twister, but SFMT/dSFMT is supposed to be faster:

http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html

We also need better seeding.

The work items are

StefanKarpinski commented 13 years ago

For distributed simulations, it seems to me that what you want is to be able to deterministically parameterize an entire run by a single seed value. So for parallel computing, we'd want to ensure that you can do that. I'm not sure what the best way to do that is, however.

JeffBezanson commented 13 years ago

Especially interesting in the case of elastic parallelism!

ViralBShah commented 13 years ago

How would we do parallel number generation? Just use different seeds to start off, and thus use different parts of the sequence of an RNG. The issue with some of these methods is that the series of numbers generated change with a change in the number of processors - not to mention elastic parallelism.

ViralBShah commented 13 years ago

Based on the thread on the mailing list, I don't think it is a good idea to seed an RNG from /dev/urandom as the default. It should be a seeding option. Also, it would be nice to have an RNG to generate random numbers from /dev/urandom altogether.

Random numbers quality testsuite: http://www.iro.umontreal.ca/~simardr/testu01/tu01.html

StefanKarpinski commented 13 years ago

There are two use-cases for random number generation:

  1. I need a deterministic, reproducible, seed-based series of pseudorandom numbers, e.g. for simulation.
  2. I want actual random behavior but have to settle for pseudorandom behavior with a good random seed.

In the former case, you need to explicitly seed the your random number generator so what the generator automatically seeds itself with is irrelevant. In the latter case, you want the default to be to use a good source of entropy — like /dev/urandom — rather than some crappy default like using the current time.

Do you have some better source of entropy that you have in mind for automatically seeding the random number generator when an explicit seed isn't given? Or are you objecting to automatically seeding the generator at all? Personally, if I just want to generate a bunch of random numbers in an ad hoc manner, I really don't want to be bothered to generate a seed — I just want to use it.

ViralBShah commented 13 years ago

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

-viral

On Jun 22, 2011, at 12:48 PM, StefanKarpinski wrote:

There are two use-cases for random number generation:

  1. I need a deterministic, reproducible, seed-based series of pseudorandom numbers, e.g. for simulation.
  2. I want actual random behavior but have to settle for pseudorandom behavior with a good random seed.

In the former case, you need to explicitly seed the your random number generator so what the generator automatically seeds itself with is irrelevant. In the latter case, you want the default to be to use a good source of entropy — like /dev/urandom — rather than some crappy default like using the current time.

Do you have some better source of entropy that you have in mind for automatically seeding the random number generator when an explicit seed isn't given? Or are you objecting to automatically seeding the generator at all? Personally, if I just want to generate a bunch of random numbers in an ad hoc manner, I really don't want to be bothered to generate a seed — I just want to use it.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1416254

ViralBShah commented 13 years ago

The counter-argument to the "I just want to use it" here is that "I just don't want to be surprised". Seeding an RNG by default has the property that it surprises you. Even saving the random seed automatically is not good enough, because the same .j file will run differently on a different computer where the saved seed is not available.

-viral

On Jun 22, 2011, at 12:48 PM, StefanKarpinski wrote:

There are two use-cases for random number generation:

  1. I need a deterministic, reproducible, seed-based series of pseudorandom numbers, e.g. for simulation.
  2. I want actual random behavior but have to settle for pseudorandom behavior with a good random seed.

In the former case, you need to explicitly seed the your random number generator so what the generator automatically seeds itself with is irrelevant. In the latter case, you want the default to be to use a good source of entropy — like /dev/urandom — rather than some crappy default like using the current time.

Do you have some better source of entropy that you have in mind for automatically seeding the random number generator when an explicit seed isn't given? Or are you objecting to automatically seeding the generator at all? Personally, if I just want to generate a bunch of random numbers in an ad hoc manner, I really don't want to be bothered to generate a seed — I just want to use it.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1416254

StefanKarpinski commented 13 years ago

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

Maybe not in C or Matlab (I checked this), but consider:

$ perl -le 'print rand'
0.481756120528193
$ perl -le 'print rand'
0.305858596301736
$ ruby -le 'print rand'
0.533189391261214
$ ruby -le 'print rand'
0.222633436955373
$ python -c 'import random; print random.random()'
0.177964855511
bash-3.2$ python -c 'import random; print random.random()'
0.191501075101
bash-3.2$ 

All three of the most user-friendly mainstream languages around automatically seed your RNG for you. So the question here is do we emulate C and Matlab or Python, Perl and Ruby. The choice seems pretty clear to me. Honestly, the only viable well-behaved options are:

The latter seems unnecessarily pedantic and annoying. How does seeding an RNG by default cause surprise?

ViralBShah commented 13 years ago

Ok, I didn't know about perl, ruby and python. Neither of these are serious numerical languages, but at least there is precedent.

Seeding an RNG by default differently on every run is surprising because it generates slightly different solutions every time. So, to get predictable answers, I now have to set a seed. When developing, one wants to know if the variations are due to changes made in the code, and possibly a bug, or due to a different stream of random numbers.

-viral

On Jun 23, 2011, at 12:27 AM, StefanKarpinski wrote:

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

Maybe not in C or Matlab (I checked this), but consider:

$ perl -le 'print rand'
0.481756120528193
$ perl -le 'print rand'
0.305858596301736
$ ruby -le 'print rand'
0.533189391261214
$ ruby -le 'print rand'
0.222633436955373
$ python -c 'import random; print random.random()'
0.177964855511
bash-3.2$ python -c 'import random; print random.random()'
0.191501075101
bash-3.2$ 

So the question here is do we emulate C and Matlab or Python, Perl and Ruby. The choice seems pretty clear to me. Honestly, the only viable well-behaved options are:

  • Automatically provide a good, random seed (i.e. use /dev/urandom)
  • Throw an error if random numbers are used without having seeded first

The latter seems unnecessarily pedantic and annoying. How does seeding an RNG by default cause surprise?

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1420235

JeffBezanson commented 13 years ago

You're right that predictability is needed for debugging, but that can never be achieved automatically --- to reproduce a run you still have to either re-seed manually or restart the system. If you're modifying your code and re-running it without re-seeding, you're going to get different answers every time in every system, unless you restart matlab between runs, which people don't do.

If you start N julia processes to do a simulation, and they all start with the same seed, you'll be doing the exact same thing on each processor without realizing it. It might take a while to notice this. It seems this would hurt the naive user. The expert user will manage their own seeding no matter what we do.

ViralBShah commented 13 years ago

Are you referring to N julia processes started on the command line, or through julia. If they are part of a "Julia parallel machine" or some such thing, then julia will do the right thing. But, yes, if you are just running a bunch of disconnected julias, then one would have to seed.

I am getting a little convinced.

-viral

On Jun 23, 2011, at 7:34 AM, JeffBezanson wrote:

You're right that predictability is needed for debugging, but that can never be achieved automatically --- to reproduce a run you still have to either re-seed manually or restart the system. If you're modifying your code and re-running it without re-seeding, you're going to get different answers every time in every system, unless you restart matlab between runs, which people don't do.

If you start N julia processes to do a simulation, and they all start with the same seed, you'll be doing the exact same thing on each processor without realizing it. It might take a while to notice this. It seems this would hurt the naive user. The expert user will manage their own seeding no matter what we do.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1422539

ViralBShah commented 13 years ago

Any idea what R does?

-viral

On Jun 23, 2011, at 12:27 AM, StefanKarpinski wrote:

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

Maybe not in C or Matlab (I checked this), but consider:

$ perl -le 'print rand'
0.481756120528193
$ perl -le 'print rand'
0.305858596301736
$ ruby -le 'print rand'
0.533189391261214
$ ruby -le 'print rand'
0.222633436955373
$ python -c 'import random; print random.random()'
0.177964855511
bash-3.2$ python -c 'import random; print random.random()'
0.191501075101
bash-3.2$ 

So the question here is do we emulate C and Matlab or Python, Perl and Ruby. The choice seems pretty clear to me. Honestly, the only viable well-behaved options are:

  • Automatically provide a good, random seed (i.e. use /dev/urandom)
  • Throw an error if random numbers are used without having seeded first

The latter seems unnecessarily pedantic and annoying. How does seeding an RNG by default cause surprise?

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1420235

JeffBezanson commented 13 years ago

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

ViralBShah commented 13 years ago

Yes, that's what I thought too - but just a useful data point.

-viral

On Jun 23, 2011, at 7:44 AM, JeffBezanson wrote:

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1422575

ViralBShah commented 13 years ago

How portable is /dev/urandom. I believe all commonly used unix-y systems should have it today. Not sure what Android/iOS and embedded systems are like, but we have much more to worry about to target those platforms..

-viral

On Jun 23, 2011, at 7:44 AM, JeffBezanson wrote:

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1422575

ViralBShah commented 13 years ago

Well, this page tells us all. Apparently there is also the Entropy Gathering Daemon for computers without /dev/urandom. EGD Sounds like something that might end the universe or something.

http://en.wikipedia.org/wiki//dev/random

I really think we ought to build the Testu01 testsuite into julia.

-viral

On Jun 23, 2011, at 7:44 AM, JeffBezanson wrote:

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1422575

StefanKarpinski commented 13 years ago

Bam:

$ echo 'runif(1)' | R
> runif(1)
[1] 0.9168643
> 
$ echo 'runif(1)' | R
> runif(1)
[1] 0.3805377
> 

So R does the same thing as Python, Perl, and Ruby.

StefanKarpinski commented 13 years ago

/dev/urandom looks pretty damned universal. But that's not really the point: the point is to automatically choose use the best practices instead of making the naive user do something non-default in order to adhere to best practices.

I'm not sure how useful testu01 really would be: it's a test of algorithms, not of implementations as such. We could, however, run our implementation through it since if there's a defect in the implementation, it would probably detect that defect as a deviation from apparent randomness. If it can't detect the defect, one could argue that it's not really a defect since the resulting RNG still appears random.

StefanKarpinski commented 13 years ago

One interesting thing I thought about (based on that best practices in bioinformatics paper I sent out) was combining one of the cheap, simple non-linear RNGs like the JKISS one described with MT to get something that has the same period as MT, is slightly more expensive, but is non-linear. But I'm a complete amateur in this area, so this might be an incredibly dumb idea.

StefanKarpinski commented 13 years ago

Apparently there is also the Entropy Gathering Daemon for computers without /dev/urandom. EGD Sounds like something that might end the universe or something.

Entropy Gathering Daemon does sound incredibly ominous. I feel like that should be the evil monster in a kids' book.

ViralBShah commented 13 years ago

The device is universal, but the implementations differ a fair bit across OSes as per the Wikipedia page. Also, people have shown quality of /dev/random to be poor on some platforms. We could go with it anyways, knowing that it is still likely to be better than other options.

-viral

On Jun 23, 2011, at 8:58 AM, StefanKarpinski wrote:

/dev/urandom looks pretty damned universal. But that's not really the point: the point is to automatically choose use the best practices instead of making the naive user do something non-default in order to adhere to best practices.

I'm not sure how useful testu01 really would be: it's a test of algorithms, not of implementations as such. We could, however, run our implementation through it since if there's a defect in the implementation, it would probably detect that defect as a deviation from apparent randomness. If it can't detect the defect, one could argue that it's not really a defect since the resulting RNG still appears random.

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1422789

StefanKarpinski commented 13 years ago

We can implement the best practice on each platform. AFAIK, /dev/urandom is fine on both Linux and OS X. On Windows, we'll use that system random bits API and we can use EGD elsewhere or whatever.

ViralBShah commented 13 years ago

The RNG is also used by flisp. Is it ok to remove all the random stuff from flisp? I can then move it all out of support and into external. My proposal is to have external/random, which includes mt19937ar.c, mt19937-64.c, dsfmt-2.1, and our wrappers that are in random.c. All of these can then be compiled into librandom.so. rand_double() and rand_float() then will get called through ccall.

We can then experiment with various ways to generate random numbers.

If rand_double() and rand_float() are called through ccall, rather than as builtins, will this greatly affect performance? It seems that this library approach is much cleaner.

JeffBezanson commented 13 years ago

rand_double and rand_float are already called through ccall. This is the best approach, because the compiler can inline the ccall into callers of rand() and then there's no overhead at all.

I like your approach, let's do it. It's fine to remove the random stuff from flisp, I'll do that part.

ViralBShah commented 13 years ago

commit 24fc7ed9004549cadabce48de2a9703f28f770fb enables the use of DSFMT. Closing this one. A new issue can be opened later when we want to use /dev/urandom and save the seed.

JeffBezanson commented 13 years ago

Now we don't have the ability to seed with a source of entropy anymore. Maybe using the 64-bit system time isn't great, but at least it's something. At the very least we need to expose dSFMT's functions for seeding with more bits than 32. Also I see that mt19937ar.c is still there; is it used for anything? And what is mt19937-64?

ViralBShah commented 13 years ago

All still in transition phase. Will get rid of mt19937ar.c and mt19937-64 (64-bit version of mt19937) once everything works. Just need to expose a couple more dSFMT interfaces. Will try to implement /dev/urandom as well.

-viral

On Jul 9, 2011, at 2:54 AM, JeffBezanson wrote:

Now we don't have the ability to seed with a source of entropy anymore. Maybe using the 64-bit system time isn't great, but at least it's something. At the very least we need to expose dSFMT's functions for seeding with more bits than 32. Also I see that mt19937ar.c is still there; is it used for anything? And what is mt19937-64?

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1535519

ViralBShah commented 13 years ago

All of this is done in commit c4aa6c772b4840ce16ecd0bec300888efd558338

-viral

On Jul 9, 2011, at 2:54 AM, JeffBezanson wrote:

Now we don't have the ability to seed with a source of entropy anymore. Maybe using the 64-bit system time isn't great, but at least it's something. At the very least we need to expose dSFMT's functions for seeding with more bits than 32. Also I see that mt19937ar.c is still there; is it used for anything? And what is mt19937-64?

Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/67#issuecomment-1535519