azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Choosing a Good Random Number Generator (RNG) a.k.a "Treading Into the Mucky Waters of RNG" #18

Open azmfaridee opened 12 years ago

azmfaridee commented 12 years ago

@mothur-westcott Our Regularized Random Forest Algorithm relies heavily on the use of RNG, I was wondering how good would be the default C++ RNG rand() function with respect to our needs. Most people say that this one is good overall, but for an application that relies heavily on RNG, it's not that good, as it's prone to repetitiveness; a matter can have adverse affect on our algorithm implementation.

Here is a good article telling about default rand() function and it's probable pitfalls: http://www.eternallyconfuzzled.com/arts/jsw_art_rand.aspx

The best Pseudo RNG available there is probably Mersenne Twister. While this is by default provided with python, php or other high level languages, this is not the case for C++. If we need this, we might as well look into libraries like mtwist or boost.

Btw, I guess this is a common problem, are there any other commands in mothur that relies heavily on a RNG? If so, how did you solve that problem?

mothur-westcott commented 12 years ago

We do not want to add special libraries to mothur because it requires our users to have the libraries installed to run mothur and we don't want to add more dependencies. We have several commands that require randomization and thus far have relied on rand(). The mothurOut class has a random number generator function. It relies on rand(), and uses RAND_MAX() to get a random number over a range. Not the greatest, but it has worked well enough for our uses so far.

//code in mothurOut
int random = (int) ((float)(highest+1) * (float)(rand()) / ((float)RAND_MAX+1.0));

The article talked about a seeded random number, you could try implementing that and looking at the distributions. If they look better than the mothurOut's random number generation, then you could use that. Or if you have the time, you could look at implementing the Mersenne Twister algorithm in mothur. What are your thoughts?

kdiverson commented 12 years ago

Looking at the pseudocode on the wiki page it doesn't look too hard to implement, at least for generating ints. Since I know you don't have a lot of time, why don't you just used the rand() function or the one in mothurOut for now and if there's time add in the MT algorithm later. It could be useful for other projects going forward. In my limited usage of rand() I haven't had any problems with it but it's good to keep these things in mind.

azmfaridee commented 12 years ago

The problem with RNG is that it's difficult to judge them under normal circumstances, because technically they always seem to work, but not sufficiently good enough. Even an RNG that does not have a sufficiently high period or uniformity will give results of our algorithm but but just not good enough.

I'm not 100% sure but I think glibc uses LCG (Linear congruential generator) for PRNG (generating Pseudo Random Number Generator)

Here are the source code for glib implementation of RNG

http://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/rand.c;hb=HEAD http://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/random.c;hb=HEAD http://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/random_r.c;hb=HEAD http://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/rand_r.c;hb=HEAD

LCG is quite good but there are some pitfalls too. Excerpts taken from the link http://en.wikipedia.org/wiki/Linear_congruential_generator#Advantages_and_disadvantages_of_LCGs says something like this:

LCGs tend to exhibit some severe defects. For instance, if an LCG is used to choose points in an n-dimensional space, the points will lie on, at most, m1/n hyperplanes (Marsaglia's Theorem, developed by George Marsaglia). This is due to serial correlation between successive values of the sequence Xn. The spectral test, which is a simple test of an LCG's quality, is based on this fact.

A further problem of LCGs is that the lower-order bits of the generated sequence have a far shorter period than the sequence as a whole if m is set to a power of 2. In general, the nth least significant digit in the base b representation of the output sequence, where bk = m for some integer k, repeats with at most period bn.

Our work is also dealing with multidimensional data (each feature can be thought of as a dimension) and we might need a good RNG with uniformity to make sure that all the features are given exact amount of important when selecting them using random index.

So it might be good idea to implement a better algorithm like Mersenne Twister or WELL. And I also understand the fact that we do not want to induce dependency to external libraries, my intention is just using them as a reference implementation. Right now as I'm working in python which generates random numbers using Mersenne Twister, it might be interesting to see the results when we port this to C++.

As @kdiverson as already pointed out, out first priority is to get the algorithm up and running. So I'd focus on that for now. But I think a good RNG would be needed in the long run.

azmfaridee commented 12 years ago

I came across these two links

The fist link is a drop-in replacement of Java's Random() The second is a drop-in replacement of C++'s rand(). This two can serve as good reference implementations for us if we choose to implement a PSRNG ourselves in the long run.

mothur-westcott commented 12 years ago

Good find, :)