Open bhermont opened 9 years ago
One update: Setting HADOOP_CMD in the enviroment (not from R) solved the "Environment variable HADOOP_CMD must be set before loading package rhdfs" issue, but the hadoop streaming error 15 persists..
On the env vars, rmr doesn't meddle with them, unless you reach for the low level escape hatch which is backend.options, then you can do anything and the opposite. The reason why it doesn't try to is because there is no reason to think that any of the settings in the user environment are appropriate for the cluster environment. As far as the error 15, you are the first on the whole internet to report it. It may require additional investigation. First thing, simplify your setup by detaching rhdfs and rjava before the first call to mapreduce. Second, show the logs according to the latest experiment. It's important to see everything, not just know that the error is the same. As far as starting a mapreduce job from a mapper, that's really beyond what is supported not only by rmr but also by mapreduce, now or ever. tasks are supposed to lack side effects or to have idempotent side effects, because they can be retried for any or no reason. I don't know how oozie works, but I know that much about mapreduce.
Hi Antonio,
I've faced a scenario where I call a mapreduce (rmr) from a Shell script inside a Mapper (This is how Oozie launches a Shell action)
Here is the flow: Oozie Launcher Job -> Lancher Map only task where shell script (Rscript myscript.r) executes -> StreamJob -> Mapper / Reducers
myscript.r
Sys.setenv(JAVA_HOME="/usr/jdk64/jdk1.7.0_67") Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.6.0-2800/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/hdp/2.2.6.0-2800/hadoop-mapreduce/hadoop-streaming.jar") library("rhdfs") library("rmr2") hdfs.init() library(Matrix)