RevolutionAnalytics / rmr2

A package that allows R developer to use Hadoop MapReduce
160 stars 149 forks source link

ERROR streaming.StreamJob: Unrecognized option: -files #163

Closed DMinedConsulting closed 9 years ago

DMinedConsulting commented 9 years ago

Hi,

I am going through the example of the KMeans Clustering Algo on the 'example' page. When I run the final 'kmeans.mr()' part, I get the following Error coming from Hadoop Streaming :

15/04/07 11:40:54 ERROR streaming.StreamJob: Unrecognized option: -files Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options] Options: -input DFS input file(s) for the Map step. (... Bunch of Options ...) -info Optional. Print detailed usage. -help Optional. Print help message.

Generic options supported are -conf specify an application configuration file -D use value for given property -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -files specify comma separated files to be copied to the map reduce cluster -libjars specify comma separated jar files to include in the classpath. -archives specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar -info Try -help for more information Streaming Command Failed!


Session Info : R version 3.1.2 (2014-10-31) Package Version : rmr2_3.3.0 Hadoop Streaming Jar : hadoop-streaming-2.5.0-cdh5.2.1.jar


Any idea on where to look to solve this issue?

Thank you,

piccolbo commented 9 years ago

Did you set the backend.parameters option? This looks like a problem with a malformed command line when invoking hadoop streaming. Sometimes people were pointing to the wrong jar file, yours looks fine at least from its name. Sometimes people set the backend.parameter option that is incorrect. It must be said that streaming is pretty finicky about options: they must be in a certain order, if I can remember generic options first then the rest. So users may specify something absolutely innocuous-looking but once the cmd line is put together from all its different parts, the order of options is not the correct one. If you are comfortable with debugging, do a debug(rmr2:::rmr.stream) trace until the variable final.command is set and let me know what you see in there. You can see the issue was raised before https://groups.google.com/forum/#!topic/rhadoop/t6XAQe3oETc but nothing happened after that. We don't particularly encourage people to use backend.parameters but when they have to it's normally some memory options set with -D, which is a generic option.

DMinedConsulting commented 9 years ago

Hi Antonio,

Sorry for the very late reply. As it turned out, the package had not been installed on all nodes of the cluster. The installation solved the issue & we did not have to modify the backend.parameters .

Thank you for your help, MM

piccolbo commented 9 years ago

Thanks for reporting on the fix and closing the issue