RevolutionAnalytics / rmr2

A package that allows R developer to use Hadoop MapReduce
160 stars 149 forks source link

About error: pipeMapRed.waitOutputThreads and additional so file #164

Open ghost opened 9 years ago

ghost commented 9 years ago

when I ran function mapreduce in rmr2, I encountered an error pipeMapRed.waitOutputThreads(): subprocess failed with code 127. My environment is that min 17.1 rebecca, hadoop 2.6.0 with localhost setup, R 3.1.3 compiled with Intel MKL, intel C/C++ compiler by myself, oracle java 1.8.40

I digged into this error, I discover that it is the shared library in the system does not load correctly. Since it was successive to run r code by using original streaming files and hadoop command: bash hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -files mapper.R,reducer.R,/opt/intel/composer_xe_2013_sp1/compiler/lib/intel64/libiomp5.so,/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so -mapper "mapper.R -m" -reducer "reducer.R -r"-input /user/hadoop/testData/* -output /user/hadoop/testData2-output

I have try to add backend.parameter = list(hadoop=list(files=/opt/intel/composer_xe_2013_sp1/compiler/lib/intel64/libiomp5.so, files=/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so)) to mapreduce function, but it comes another error. I spectacular that it is caused by hadoop streaming does not accept 2 and more -files.

Therefore, I modify the original file, R/streaming.R, in the package before building. I modify the files parameter in final.command with R files = paste(collapse = ",", c(image.files, map.file, reduce.file, combine.file, "/opt/intel/composer_xe_2013_sp1/compiler/lib/intel64/libiomp5.so", "/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so"))

Then it fix the error pipeMapRed.waitOutputThreads(): subprocess failed with code 127. I wonder is it possible to add a new parameter in rmr2 to modify the input files. Or is there another solution to solve this problem by editing the environment of hadoop.

piccolbo commented 9 years ago

There are some limitations related to the specific order of options that may be a problem here. In short backend.parameters is safe for generic options such as -D, which is the one used most often. -files is not generic so it needs to be in a certain order wrt generic ones and there's only so much rmr2 can do to order them right without embedding the full knowledge of what is generic plus a complete refactor of how the cmd line is put together right now (one would have to delay conversion to a string until the cmd line is fully specified). It's quite a bit of development and added, permanent complexity for a very specialized use case. The other thing is that -files is already used and It does accept a list of files, which suggests that specifying it twice may not be acceptable, but I am not 100% sure. If that's the case, allowing the user to specify additional -files arguments would require an even deeper refactor.