Improve Mersenne Twister performance

boostorg / compute

A C++ GPU Computing Library for OpenCL

http://boostorg.github.io/compute/

Boost Software License 1.0

1.56k stars 332 forks source link

Improve Mersenne Twister performance #81

Open etam opened 10 years ago

etam commented 10 years ago

On my machine generating 1024*1024*32 random numbers takes about 10 seconds.

Here: http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/mersenne-twister/ (and complete downloads here http://www.fixstars.com/en/opencl/book/sample/) you can find implementation that does the same amount of work in about 0.35s.

kylelutz commented 10 years ago

Thanks for the report and the links. I'll try to find some time to take a look and update the code.

etam commented 10 years ago

Well. After digging deeper, I found that the implementation there is not the best solution. The "official" one http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html should be used instead.

kylelutz commented 10 years ago

Interesting. We should be able to integrate their algorithm (though perhaps as another engine, mtgp_engine).

dacmot commented 7 years ago

I'm also very interested in this. On my machine (12 CPU cores + GTX 680) 10MB of uniform_real_distribution with mersenne_twister_engine takes about 1.6 seconds to generate, which is better than for @etam, but still very slow. I'm wondering if the fact that the mersenne_twister_engine code creates a second temporary vector and does two transforms instead of composing the scaling kernel could have something to do with it. Is it even possible to compose kernels with boost::compute?

Also, I was wondering what were the developers' thoughts on adding more engines. Doing a search on GPU random number generations I stumbled on a few including https://github.com/clMathLibraries/clRNG and http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-uniform.html. The MWC64X one is of particular interest for me as I don't need an extremely long period but performance is much more important. Licenses are BSD.

jszuppe commented 7 years ago

I'd recommend improving current Philox implementation. Right now in Boost.Compute it's designed badly and has poor performance, It can be improved to achieve 200 - 350 GB/s (50 - 90 GSamples/s) on modern top GPUs (depending on GPU and it's architecture). It's really simple RNG. You can also implement XORWOW, which should achieve similar or higher performance.

dacmot commented 7 years ago

There's a Philox implementation? I only see a ThreeFry and a linear congruential along with the MT engine. Also the ThreeFry is not in the API overview and doesn't compile when used in conjunction with uniform_real_distribution since its generate() method doesn't take an scaling kernel.

jszuppe commented 7 years ago

Oh, sorry, my mistake, indeed it's ThreeFry. Nonetheless, it's fast. I think that adding Philox to Boost.Compute is the best option to have fast random number generator, and should not be so hard. Unfortunately, recently I don't have enough free time to do it.