Open piccolbo opened 12 years ago
One aspect that is important to make parallelization easier in a language like R is parallel random generation. If we do a
sapply(1:100, function(i) rnorm(1))
we have certain guarantees about the distribution of the vector thus created. But if you do a
mapreduce(to.dfs(1:100), function(k,v) keyval(NULL, rnorm(1))
we need to switch to parallel number generation, maybe transparently to the user, maybe as an easy switch. Would unique seeding per task attempt do the trick?
see also http://web.archive.org/web/20100530123745/http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/ http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf
One aspect that is important to make parallelization easier in a language like R is parallel random generation. If we do a
we have certain guarantees about the distribution of the vector thus created. But if you do a
we need to switch to parallel number generation, maybe transparently to the user, maybe as an easy switch. Would unique seeding per task attempt do the trick?