RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

parallel random number generation #99

Open piccolbo opened 12 years ago

piccolbo commented 12 years ago

One aspect that is important to make parallelization easier in a language like R is parallel random generation. If we do a

sapply(1:100, function(i) rnorm(1))

we have certain guarantees about the distribution of the vector thus created. But if you do a

mapreduce(to.dfs(1:100), function(k,v) keyval(NULL, rnorm(1))

we need to switch to parallel number generation, maybe transparently to the user, maybe as an easy switch. Would unique seeding per task attempt do the trick?

piccolbo commented 12 years ago

see also http://web.archive.org/web/20100530123745/http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/ http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf