RevolutionAnalytics / rmr2

A package that allows R developer to use Hadoop MapReduce
160 stars 149 forks source link

HADOOP_CMD getting lost... #173

Open bhermont opened 9 years ago

bhermont commented 9 years ago

Hi Antonio,

I've faced a scenario where I call a mapreduce (rmr) from a Shell script inside a Mapper (This is how Oozie launches a Shell action)

Here is the flow: Oozie Launcher Job -> Lancher Map only task where shell script (Rscript myscript.r) executes -> StreamJob -> Mapper / Reducers

myscript.r

Sys.setenv(JAVA_HOME="/usr/jdk64/jdk1.7.0_67") Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.6.0-2800/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar") library("rhdfs") library("rmr2") hdfs.init() library(Matrix)

## Logs Launcher Job (Job that launches Map only task) On this map, the following shell script is executed as a system call (Rscript myscript.r) - Here HADOOP_CMD is set correctly, but an error on the mr function is logged (Probably due the Streaming Mapper error when a hdfs function is called from inside the mr function). Launcher (Mapper) log: ``` Loading required package: methods Loading required package: rJava HADOOP_CMD=/usr/bin/hadoop Be sure to run hdfs.init() Please review your hadoop settings. See help(hadoop.settings) Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, : hadoop streaming failed with error code 15 Calls: getTags -> mapreduce -> mr ``` The rmr streaming starts a StreamJob which can have Mappers and Reducers. StreamJob (Mapper) log: ``` Log Type: stderr Log Upload Time: Thu Aug 06 19:24:41 -0400 2015 Log Length: 2722 Loading objects: Loading objects: backend.parameters combine Please review your hadoop settings. See help(hadoop.settings) combine.file combine.line debug default.input.format default.output.format in.folder in.memory.combine input.format libs map map.file map.line out.folder output.format pkg.opts postamble preamble profile.nodes reduce reduce.file reduce.line rmr.global.env rmr.local.env save.env tempfile vectorized.reduce verbose work.dir Loading required package: methods Loading required package: rJava Loading required package: rhdfs Error : .onLoad failed in loadNamespace() for 'rhdfs', details: call: fun(libname, pkgname) error: Environment variable HADOOP_CMD must be set before loading package rhdfs Warning in FUN(X[[i]], ...) : can't load rhdfs Loading required package: rmr2 Loading required package: Matrix ``` However, calling the myscript.r from command line works fine. Here is my question: Should rmr propagate the environment envs in this case, or this should be responsibility of the enviroment to provide the value for the HADOOP_CMD variable?
bhermont commented 9 years ago

One update: Setting HADOOP_CMD in the enviroment (not from R) solved the "Environment variable HADOOP_CMD must be set before loading package rhdfs" issue, but the hadoop streaming error 15 persists..

piccolbo commented 9 years ago

On the env vars, rmr doesn't meddle with them, unless you reach for the low level escape hatch which is backend.options, then you can do anything and the opposite. The reason why it doesn't try to is because there is no reason to think that any of the settings in the user environment are appropriate for the cluster environment. As far as the error 15, you are the first on the whole internet to report it. It may require additional investigation. First thing, simplify your setup by detaching rhdfs and rjava before the first call to mapreduce. Second, show the logs according to the latest experiment. It's important to see everything, not just know that the error is the same. As far as starting a mapreduce job from a mapper, that's really beyond what is supported not only by rmr but also by mapreduce, now or ever. tasks are supposed to lack side effects or to have idempotent side effects, because they can be retried for any or no reason. I don't know how oozie works, but I know that much about mapreduce.