RevolutionAnalytics / rmr2

A package that allows R developer to use Hadoop MapReduce
160 stars 149 forks source link

from.dfs produces "file does not exist" error #161

Open kardes opened 9 years ago

kardes commented 9 years ago

Hi, I set up R and Hadoop using cloudera quick start VM CDH 5.3.

R version 3.1.2. VirtualBox Manager 4.3.20 running on MacOSX 10.7.5 I followed the blog http://www.r-bloggers.com/integration-of-r-rstudio-and-hadoop-in-a-virtualbox-cloudera-demo-vm-on-mac-os-x/ to set up R and Hadoop and turned of MR2/YARN. Instead I Am using MR1.

Everything seems to work fine but the from.dfs function.

I am using the simple example in R: small.ints <- to.dfs(1:1000) out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) df <- as.data.frame(from.dfs(out))

from.dfs produces the following error. If you could be of any hep, I'd greatly appreciate it. Thank you very much. -EK

When I use it I get the error: at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128432 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/422 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

piccolbo commented 9 years ago

can you enter

out()

and paster the output back here?

kardes commented 9 years ago

out() [1] "/tmp/RtmpmQu2O7/file1b584440dee3"

piccolbo commented 9 years ago

Wihout closing that R session where you did the last step, try at the shell prompt

hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpmQu2O7/file1b584440dee3

On Tue, Mar 17, 2015 at 10:39 AM, kardes notifications@github.com wrote:

out() [1] "/tmp/RtmpmQu2O7/file1b584440dee3"

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-82492960 .

kardes commented 9 years ago

I opened a new terminal window (without closing the current one with the R session) and entered that line:

$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpmQu2O7/file1b584440dee3 Not a valid JAR: /home/cloudera/dumptb

piccolbo commented 9 years ago

make sure HADOOP_STREAMING is set in that shell instance. It looks like it's empty

On Tue, Mar 17, 2015 at 10:52 AM, kardes notifications@github.com wrote:

I opened a new terminal window (without closing the current one with the R session) and entered that line:

$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpmQu2O7/file1b584440dee3 Not a valid JAR: /home/cloudera/dumptb

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-82496775 .

kardes commented 9 years ago

Could you please pervade specific instructions on how to do it? So far, after

small.ints <- to.dfs(1:1000) out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))

I check out()

out() [1] "/tmp/Rtmp5Nt5L7/file25ed2392eeba"

Then I get out of R using Ctrl-Z and entering bg (putting R into the background) Then I enter

$hadoop jar $HADOOP_STREAMING dumptb /tmp/Rtmp5Nt5L7/file25ed2392eeba

and get

Not a valid JAR: /home/cloudera/dumptb

thanks

piccolbo commented 9 years ago

That's part of installing rmr2. No HADOOP_STREAMING, no rmr2.

On Tue, Mar 17, 2015 at 12:13 PM, kardes notifications@github.com wrote:

Could you please pervade specific instructions on how to do it? So far, after

small.ints <- to.dfs(1:1000) out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))

I check out()

out() [1] "/tmp/Rtmp5Nt5L7/file25ed2392eeba"

Then I get out of R using Ctrl-Z and entering bg (putting R into the background) Then I enter

$hadoop jar $HADOOP_STREAMING dumptb /tmp/Rtmp5Nt5L7/file25ed2392eeba

and get

Not a valid JAR: /home/cloudera/dumptb

thanks

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-82543239 .

kardes commented 9 years ago

Hi Antonio, I don't understand anything from results when I search for hadoop streaming rmr2. Coud you please point me to a resource when I get set this up correctly? Please, Thanks.

piccolbo commented 9 years ago

You are seriously telling me you do not understand a list of two files? Can you ask a more specific question?

On Fri, Mar 20, 2015 at 11:09 AM, kardes notifications@github.com wrote:

I obtain the following when do a find:

$find $HADOOP_HOME -name hadoop-streaming*.jar

/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84091109 .

kardes commented 9 years ago

sorry, of course not.

I still get the error on the top of this page (the original error). I tried the following:

Opened a terminal and entered: $echo "export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar" >> ~/.bashrc

Then in R, I entered: small.ints <- to.dfs(1:1000) out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) df <- as.data.frame(from.dfs(out))

in the last line above, I get the same error. I tried the following last, but I am not sure how to proceed. please help! thanks.

out() [1] "/tmp/RtmpuZJz6S/file1c1778b1f9b" ^Z [1]+ Stopped R [cloudera@quickstart ~]$ bg [1]+ R & [cloudera@quickstart ~]$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpuZJz6S/file1c1778b1f9b Exception in thread "main" java.io.FileNotFoundException: Path is not a file: /tmp/RtmpuZJz6S/file1c1778b1f9b/_logs at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1800) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1772) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1171)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1159)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1149)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:270)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:230)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56)
at org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:102)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:83)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/RtmpuZJz6S/file1c1778b1f9b/_logs at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1800) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1772) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at org.apache.hadoop.ipc.Client.call(Client.java:1411)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:246)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1169)
piccolbo commented 9 years ago

The last error you got is expected because you did a dumptb on a directory which is not allowed, you'd have to list that first and dump the files called part-* I would like to be sure that the R session you are working in has the correct setting. You added an apparently correct line to .bashrc, but that's the wrong file, because it only affects interactive shells. You want to use the .profile or the .bash_profile. Then you need to reload it with . .profile or . .bash_profile then you need to restart R. Then you can do a Sys.getenv("HADOOP_STREAMING") to make sure the setting has been picked up correctly, then you should try the example again and see what happens. The devil is in the details as they say.

On Fri, Mar 20, 2015 at 12:49 PM, kardes notifications@github.com wrote:

sorry, of course not.

I still get the error on the top of this page (the original error). I tried the following:

Opened a terminal and entered: $echo "export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bashrc

Then in R, I entered: small.ints <- to.dfs(1:1000) out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) df <- as.data.frame(from.dfs(out))

in the last line above, I get the same error. I tried the following last, but I am not sure how to proceed. please help! thanks.

out() [1] "/tmp/RtmpuZJz6S/file1c1778b1f9b" ^Z [1]+ Stopped R [cloudera@quickstart ~]$ bg [1]+ R & [cloudera@quickstart ~]$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpuZJz6S/file1c1778b1f9b Exception in thread "main" java.io.FileNotFoundException: Path is not a file: /tmp/RtmpuZJz6S/file1c1778b1f9b/_logs at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1800) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1772) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1171) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1159) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1149) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:270) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237) at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56) at org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:102) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:83) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/RtmpuZJz6S/file1c1778b1f9b/_logs at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1800) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1772) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:246) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1169)

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84118859 .

kardes commented 9 years ago

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar" >> ~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1] "/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar" library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL: http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001 15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

piccolbo commented 9 years ago

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1] "/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar" library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL: http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001 15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647 .

kardes commented 9 years ago

I am not using YARN. I am using MR1. that's why I am doing:

echo "export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile

On Fri, Mar 20, 2015 at 4:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1]

"/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar" library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL:

http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001 15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093) at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085) at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub < https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647

.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84197054 .

piccolbo commented 9 years ago

You understand that since I can't reproduce this, unless you can give me access to a test system, you'll have to debug it yourself but I will try to help. My current thinking is that your streaming installation has a problem which is outside rmr2 control or something we can fix, but at least I would be able to tell you to go bug the cloudera guys with some good argument to do so. To do that, we need to repro the error outside R. To do that,

  1. Run your mapreduce job.
  2. Follow the console output until a line like the following

15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/

  1. make a note of that path (which will be different for every run)
  2. open a shell without closing the current R session
  3. list that directory, that should be something like

hdfs dfs -ls path

  1. It should contain several files named part-. Pick one, say file1
  2. Now try to dump its contents

hadoop jar $HADOOP_STREAMING dumptb /

<> brackets mean "replace with actual value"

It should fail exactly the way the from.dfs function fails. If that's the case, you have something to report to cloudera. Otherwise, we need to debug from.dfs more closely. Thanks

On Fri, Mar 20, 2015 at 5:11 PM, kardes notifications@github.com wrote:

I am not using YARN. I am using MR1. that's why I am doing:

echo "export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile

On Fri, Mar 20, 2015 at 4:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1]

"/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL:

http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001

15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647

.

— Reply to this email directly or view it on GitHub < https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84197054

.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84201009 .

kardes commented 9 years ago

actually I have done this before, and what I get when I do hadoop jar $HADOOP_STREAMING dumptb / is that some meaningless figures/shapes/rectangles appear on the screen. and I do not get the error I get using from.dfs().

On Mon, Mar 23, 2015 at 10:10 AM, Antonio Piccolboni < notifications@github.com> wrote:

You understand that since I can't reproduce this, unless you can give me access to a test system, you'll have to debug it yourself but I will try to help. My current thinking is that your streaming installation has a problem which is outside rmr2 control or something we can fix, but at least I would be able to tell you to go bug the cloudera guys with some good argument to do so. To do that, we need to repro the error outside R. To do that,

  1. Run your mapreduce job.
  2. Follow the console output until a line like the following

15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/

  1. make a note of that path (which will be different for every run)
  2. open a shell without closing the current R session
  3. list that directory, that should be something like

hdfs dfs -ls path

  1. It should contain several files named part-. Pick one, say file1
  2. Now try to dump its contents

hadoop jar $HADOOP_STREAMING dumptb /

<> brackets mean "replace with actual value"

It should fail exactly the way the from.dfs function fails. If that's the case, you have something to report to cloudera. Otherwise, we need to debug from.dfs more closely. Thanks

On Fri, Mar 20, 2015 at 5:11 PM, kardes notifications@github.com wrote:

I am not using YARN. I am using MR1. that's why I am doing:

echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile

On Fri, Mar 20, 2015 at 4:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1]

"/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL:

http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001

15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647

.

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84197054

.

— Reply to this email directly or view it on GitHub < https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84201009

.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-85099857 .

kardes commented 9 years ago

I don't know. I am very new to all this and maybe I made a mistake setting things up I am not sure. I can try to set up everything from scratch but I couldn't find a good blog that describes R+Hadoop setup using recent versions of CDH

On Mon, Mar 23, 2015 at 2:09 PM, Erim Kardes erimkardes@gmail.com wrote:

actually I have done this before, and what I get when I do hadoop jar $HADOOP_STREAMING dumptb / is that some meaningless figures/shapes/rectangles appear on the screen. and I do not get the error I get using from.dfs().

On Mon, Mar 23, 2015 at 10:10 AM, Antonio Piccolboni < notifications@github.com> wrote:

You understand that since I can't reproduce this, unless you can give me access to a test system, you'll have to debug it yourself but I will try to help. My current thinking is that your streaming installation has a problem which is outside rmr2 control or something we can fix, but at least I would be able to tell you to go bug the cloudera guys with some good argument to do so. To do that, we need to repro the error outside R. To do that,

  1. Run your mapreduce job.
  2. Follow the console output until a line like the following

15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/

  1. make a note of that path (which will be different for every run)
  2. open a shell without closing the current R session
  3. list that directory, that should be something like

hdfs dfs -ls path

  1. It should contain several files named part-. Pick one, say file1
  2. Now try to dump its contents

hadoop jar $HADOOP_STREAMING dumptb /

<> brackets mean "replace with actual value"

It should fail exactly the way the from.dfs function fails. If that's the case, you have something to report to cloudera. Otherwise, we need to debug from.dfs more closely. Thanks

On Fri, Mar 20, 2015 at 5:11 PM, kardes notifications@github.com wrote:

I am not using YARN. I am using MR1. that's why I am doing:

echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile

On Fri, Mar 20, 2015 at 4:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1]

"/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL:

http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001

15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647

.

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84197054

.

— Reply to this email directly or view it on GitHub < https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84201009

.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-85099857 .

piccolbo commented 9 years ago

Sorry, I forgot to add a redirection to that command, so you got the contents of a binary file in console. As meaningless as it looked, it was probably just fine. Try this to be absolutely sure

hadoop jar $HADOOP_STREAMING dumptb / > /tmp/dumptb.out

(the last greater sign to be entered as is, no substitutions)

The other thing is that if this succeeds, it's more of a rmr2 problem (my plate). So try this

debug(from.dfs)

from.dfs(out)

step until function dumptb definition, and step one more

debug(dumptb)

c

You are now in dumptb, a Very Simple Function

Please print the contents of src and also

paste(hadoop.streaming(), "dumptb", rmr.normalize.path(x), ">>", rmr.normalize.path(dest))

And paste the results here. The idea is that the dumptb function does almost exactly what you typed in at cmd line, and worked. So there must be some difference either in the cmd entered or in the environment in which it is executed. Thanks for your patience and cooperation.

On Mon, Mar 23, 2015 at 2:11 PM, kardes notifications@github.com wrote:

I don't know. I am very new to all this and maybe I made a mistake setting things up I am not sure. I can try to set up everything from scratch but I couldn't find a good blog that describes R+Hadoop setup using recent versions of CDH

On Mon, Mar 23, 2015 at 2:09 PM, Erim Kardes erimkardes@gmail.com wrote:

actually I have done this before, and what I get when I do hadoop jar $HADOOP_STREAMING dumptb / is that some meaningless figures/shapes/rectangles appear on the screen. and I do not get the error I get using from.dfs().

On Mon, Mar 23, 2015 at 10:10 AM, Antonio Piccolboni < notifications@github.com> wrote:

You understand that since I can't reproduce this, unless you can give me access to a test system, you'll have to debug it yourself but I will try to help. My current thinking is that your streaming installation has a problem which is outside rmr2 control or something we can fix, but at least I would be able to tell you to go bug the cloudera guys with some good argument to do so. To do that, we need to repro the error outside R. To do that,

  1. Run your mapreduce job.
  2. Follow the console output until a line like the following

15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/

  1. make a note of that path (which will be different for every run)
  2. open a shell without closing the current R session
  3. list that directory, that should be something like

hdfs dfs -ls path

  1. It should contain several files named part-. Pick one, say file1
  2. Now try to dump its contents

hadoop jar $HADOOP_STREAMING dumptb /

<> brackets mean "replace with actual value"

It should fail exactly the way the from.dfs function fails. If that's the case, you have something to report to cloudera. Otherwise, we need to debug from.dfs more closely. Thanks

On Fri, Mar 20, 2015 at 5:11 PM, kardes notifications@github.com wrote:

I am not using YARN. I am using MR1. that's why I am doing:

echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile

On Fri, Mar 20, 2015 at 4:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1]

"/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL:

http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001

15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647

.

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84197054

.

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84201009

.

— Reply to this email directly or view it on GitHub < https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-85099857

.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-85199324 .

piccolbo commented 9 years ago

correction, that cmd should read

paste(hadoop.streaming(), "dumptb", rmr.normalize.path(src[[1]]), ">>", rmr.normalize.path(dest))

On Mon, Mar 23, 2015 at 2:30 PM, Antonio Piccolboni <antonio@piccolboni.info

wrote:

Sorry, I forgot to add a redirection to that command, so you got the contents of a binary file in console. As meaningless as it looked, it was probably just fine. Try this to be absolutely sure

hadoop jar $HADOOP_STREAMING dumptb / > /tmp/dumptb.out

(the last greater sign to be entered as is, no substitutions)

The other thing is that if this succeeds, it's more of a rmr2 problem (my plate). So try this

debug(from.dfs)

from.dfs(out)

step until function dumptb definition, and step one more

debug(dumptb)

c

You are now in dumptb, a Very Simple Function

Please print the contents of src and also

paste(hadoop.streaming(), "dumptb", rmr.normalize.path(x), ">>", rmr.normalize.path(dest))

And paste the results here. The idea is that the dumptb function does almost exactly what you typed in at cmd line, and worked. So there must be some difference either in the cmd entered or in the environment in which it is executed. Thanks for your patience and cooperation.

On Mon, Mar 23, 2015 at 2:11 PM, kardes notifications@github.com wrote:

I don't know. I am very new to all this and maybe I made a mistake setting things up I am not sure. I can try to set up everything from scratch but I couldn't find a good blog that describes R+Hadoop setup using recent versions of CDH

On Mon, Mar 23, 2015 at 2:09 PM, Erim Kardes erimkardes@gmail.com wrote:

actually I have done this before, and what I get when I do hadoop jar $HADOOP_STREAMING dumptb / is that some meaningless figures/shapes/rectangles appear on the screen. and I do not get the error I get using from.dfs().

On Mon, Mar 23, 2015 at 10:10 AM, Antonio Piccolboni < notifications@github.com> wrote:

You understand that since I can't reproduce this, unless you can give me access to a test system, you'll have to debug it yourself but I will try to help. My current thinking is that your streaming installation has a problem which is outside rmr2 control or something we can fix, but at least I would be able to tell you to go bug the cloudera guys with some good argument to do so. To do that, we need to repro the error outside R. To do that,

  1. Run your mapreduce job.
  2. Follow the console output until a line like the following

15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/

  1. make a note of that path (which will be different for every run)
  2. open a shell without closing the current R session
  3. list that directory, that should be something like

hdfs dfs -ls path

  1. It should contain several files named part-. Pick one, say file1
  2. Now try to dump its contents

hadoop jar $HADOOP_STREAMING dumptb /

<> brackets mean "replace with actual value"

It should fail exactly the way the from.dfs function fails. If that's the case, you have something to report to cloudera. Otherwise, we need to debug from.dfs more closely. Thanks

On Fri, Mar 20, 2015 at 5:11 PM, kardes notifications@github.com wrote:

I am not using YARN. I am using MR1. that's why I am doing:

echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile

On Fri, Mar 20, 2015 at 4:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

I am at a loss. Those are not files that rmr2 manipulates, at least not explicitly. The only thing I can think of is that there are two streaming jars packed with CDH and you are using the wrong one. If you are using YARN, you need to use the other one. That's done by setting HADOOP_STREAMING to the other path returned by the cmd fine that you shared a few messages back. No clue how this could be related to the error, but something we want to be sure it's not in the way.

On Fri, Mar 20, 2015 at 3:07 PM, kardes notifications@github.com wrote:

I did update the .bash_profile file. upon your recommendation, I still get the error. could you please let me know how to proceed? thank you very much. here is a snapshot of my shell.

[cloudera@quickstart ~]$ echo "export

HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

~/.bash_profile [cloudera@quickstart ~]$ source ~/.bash_profile [cloudera@quickstart ~]$ R

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Sys.getenv("HADOOP_STREAMING") [1]

"/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.3.2.jar"

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr small.ints <- to.dfs(1:1000) 15/03/20 15:00:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/20 15:00:29 INFO compress.CodecPool: Got brand-new compressor [.deflate] out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2)) packageJobJar: [/tmp/RtmpT6ad7h/rmr-local-env14e469666bf9, /tmp/RtmpT6ad7h/rmr-global-env14e431d70a48, /tmp/RtmpT6ad7h/rmr-streaming-map14e4610d911a, /tmp/hadoop-cloudera/hadoop-unjar1796188924023766754/] [] /tmp/streamjob8004571112410021052.jar tmpDir=null 15/03/20 15:00:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/03/20 15:00:33 INFO mapred.FileInputFormat: Total input paths to process : 1 15/03/20 15:00:34 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/cloudera/mapred/local] 15/03/20 15:00:34 INFO streaming.StreamJob: Running job: job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: To kill this job, run: 15/03/20 15:00:34 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201503201448_0001 15/03/20 15:00:34 INFO streaming.StreamJob: Tracking URL:

http://quickstart.cloudera:50030/jobdetails.jsp?jobid=job_201503201448_0001

15/03/20 15:00:35 INFO streaming.StreamJob: map 0% reduce 0% 15/03/20 15:00:57 INFO streaming.StreamJob: map 100% reduce 0% 15/03/20 15:01:02 INFO streaming.StreamJob: map 100% reduce 100% 15/03/20 15:01:02 INFO streaming.StreamJob: Job complete: job_201503201448_0001 15/03/20 15:01:02 INFO streaming.StreamJob: Output: /tmp/RtmpT6ad7h/file14e456031345 df <- as.data.frame(from.dfs(out)) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/0 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at

org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/128464 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at

org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/30402

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at

org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:8020/user/cloudera/122 at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)

at

org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)

at

org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)

at

org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)

at

org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at

org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84166647

.

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84197054

.

— Reply to this email directly or view it on GitHub <

https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-84201009

.

— Reply to this email directly or view it on GitHub < https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-85099857

.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-85199324 .

piccolbo commented 9 years ago

On Wed, Mar 25, 2015 at 3:29 PM, kardes notifications@github.com wrote:

I did the following:

[cloudera@quickstart ~]$ hadoop fs -ls /tmp/RtmpUzjoWy/file1c7dbd916e7

Found 4 items -rw-r--r-- 1 cloudera supergroup 0 2015-03-25 15:18 /tmp/RtmpUzjoWy/file1c7dbd916e7/_SUCCESS drwxrwxrwx - cloudera supergroup 0 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/_logs -rw-r--r-- 1 cloudera supergroup 422 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00000 -rw-r--r-- 1 cloudera supergroup 122 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001

and then

[cloudera@quickstart ~]$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpUzjoWy/file1c7dbd916e7 / part-00001 > /tmp/dumptb.out Exception in thread "main" java.io.FileNotFoundException: Path is not a file: /tmp/RtmpUzjoWy/file1c7dbd916e7/_logs

There's two spaces two many in this path /tmp/RtmpUzjoWy/file1c7dbd916e7 / part-00001 See them? Around /? plz remove and try again.

at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1800) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1772) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1171) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1159) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1149) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:270) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:237) at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:230) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1448) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:301) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:297) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56) at org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:102) at org.apache.hadoop.streaming.DumpTypedBytes.run( DumpTypedBytes.java:83) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /tmp/RtmpUzjoWy/file1c7dbd916e7/_logs at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:69) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1820) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1800) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1772) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:527) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:85) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:356) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:246) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1169) ... 22 more

[cloudera@quickstart ~]$

so I am getting the error in this case, given I am doing everything as you suggested. How could I proceed in this case? Thanks for your help.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-86237310 .

kardes commented 9 years ago

after running the mapreduce job, I did the following:

[cloudera@quickstart ~]$ hadoop fs -ls /tmp/RtmpUzjoWy/file1c7dbd916e7

Found 4 items -rw-r--r-- 1 cloudera supergroup 0 2015-03-25 15:18 /tmp/RtmpUzjoWy/file1c7dbd916e7/_SUCCESS drwxrwxrwx - cloudera supergroup 0 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/_logs -rw-r--r-- 1 cloudera supergroup 422 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00000 -rw-r--r-- 1 cloudera supergroup 122 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001

and then

[cloudera@quickstart ~]$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001 > /tmp/dumptb.out [cloudera@quickstart ~]$

so I did not get an error in this case. so I continued:

Browse[2]> debug(dumptb) Browse[2]> c debugging in: dumptb(part.list(fname), tmp) debug: { lapply(src, function(x) system(paste(hadoop.streaming(), "dumptb", x, ">>", dest))) } Browse[2]> src [1] "0" "128429" "422" "122"

Browse[2]> paste(hadoop.streaming(),"dumptb",rmr.normalize.path(src[[1]]),">>",rmr.normalize.path(dest)) Error in paste(hadoop.streaming(), "dumptb", rmr.normalize.path(src[[1]]), : could not find function "rmr.normalize.path" Browse[2]>

please let me know how to proceed. thanks for your time.

piccolbo commented 9 years ago

I have it, part.list is failing. Probably a problem with hdfs.ls

rmr2:::hdfs.ls(out())

Please share what that returns, its class.

On Wed, Mar 25, 2015 at 3:56 PM, kardes notifications@github.com wrote:

after running the mapreduce job, I did the following:

[cloudera@quickstart ~]$ hadoop fs -ls /tmp/RtmpUzjoWy/file1c7dbd916e7

Found 4 items -rw-r--r-- 1 cloudera supergroup 0 2015-03-25 15:18 /tmp/RtmpUzjoWy/file1c7dbd916e7/_SUCCESS drwxrwxrwx - cloudera supergroup 0 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/_logs -rw-r--r-- 1 cloudera supergroup 422 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00000 -rw-r--r-- 1 cloudera supergroup 122 2015-03-25 15:17 /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001

and then

[cloudera@quickstart ~]$ hadoop jar $HADOOP_STREAMING dumptb /tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001 > /tmp/dumptb.out [cloudera@quickstart ~]$

so I did not get an error in this case. so I continued:

Browse[2]> debug(dumptb) Browse[2]> c debugging in: dumptb(part.list(fname), tmp) debug: { lapply(src, function(x) system(paste(hadoop.streaming(), "dumptb", x, ">>", dest))) } Browse[2]> src [1] "0" "128429" "422" "122"

Browse[2]> paste(hadoop.streaming(),"dumptb",rmr.normalize.path(src[[1]]),">>",rmr.normalize.path(dest)) Error in paste(hadoop.streaming(), "dumptb", rmr.normalize.path(src[[1]]), : could not find function "rmr.normalize.path" Browse[2]>

please let me know how to proceed. thanks for your time.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-86245150 .

kardes commented 9 years ago

Browse[2]> rmr2:::hdfs.ls(out()) [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "-rw-r--r--" "1" "cloudera" "supergroup" "0" "2015-03-25" "15:18" [2,] "drwxrwxrwx" "-" "cloudera" "supergroup" "0" "2015-03-25" "15:17" [3,] "-rw-r--r--" "1" "cloudera" "supergroup" "422" "2015-03-25" "15:17" [4,] "-rw-r--r--" "1" "cloudera" "supergroup" "122" "2015-03-25" "15:17" [,8]
[1,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/_SUCCESS"
[2,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/_logs"
[3,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/part-00000" [4,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001" Browse[2]> str(rmr2:::hdfs.ls(out())) chr [1:4, 1:8] "-rw-r--r--" "drwxrwxrwx" "-rw-r--r--" "-rw-r--r--" ... Browse[2]> class(rmr2:::hdfs.ls(out())) [1] "matrix" Browse[2]>

piccolbo commented 9 years ago

This is close to impossible. Please enter

packageDescription("rmr2")

On Wed, Mar 25, 2015 at 4:25 PM, kardes notifications@github.com wrote:

Browse[2]> rmr2:::hdfs.ls(out()) [,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] "-rw-r--r--" "1" "cloudera" "supergroup" "0" "2015-03-25" "15:18" [2,] "drwxrwxrwx" "-" "cloudera" "supergroup" "0" "2015-03-25" "15:17" [3,] "-rw-r--r--" "1" "cloudera" "supergroup" "422" "2015-03-25" "15:17" [4,] "-rw-r--r--" "1" "cloudera" "supergroup" "122" "2015-03-25" "15:17" [,8]

[1,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/_SUCCESS"

[2,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/_logs"

[3,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/part-00000" [4,] "/tmp/RtmpUzjoWy/file1c7dbd916e7/part-00001" Browse[2]> str(rmr2:::hdfs.ls(out())) chr [1:4, 1:8] "-rw-r--r--" "drwxrwxrwx" "-rw-r--r--" "-rw-r--r--" ... Browse[2]> class(rmr2:::hdfs.ls(out())) [1] "matrix" Browse[2]>

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-86251098 .

kardes commented 9 years ago

packageDescription("rmr2") Package: rmr2 Type: Package Title: R and Hadoop Streaming Connector Version: 2.0.2 Date: 2012-4-12 Author: Revolution Analytics Depends: R (>= 2.6.0), Rcpp, RJSONIO (>= 0.8-2), digest, functional, stringr, plyr Suggests: quickcheck Collate: basic.R keyval.R IO.R local.R streaming.R mapreduce.R extras.R ..... Maintainer: Revolution Analytics rhadoop@revolutionanalytics.com Description: Supports the map reduce programming model on top of hadoop streaming License: Apache License (== 2.0) Packaged: 2012-12-05 03:35:30 UTC; antonio Built: R 3.1.2; x86_64-redhat-linux-gnu; 2015-03-12 22:30:28 UTC; unix

-- File: /usr/lib64/R/library/rmr2/Meta/package.rds

piccolbo commented 9 years ago

Please upgrade to the latest version. Thanks

Antonio

On Wed, Mar 25, 2015 at 4:36 PM, kardes notifications@github.com wrote:

packageDescription("rmr2") Package: rmr2 Type: Package Title: R and Hadoop Streaming Connector Version: 2.0.2 Date: 2012-4-12 Author: Revolution Analytics Depends: R (>= 2.6.0), Rcpp, RJSONIO (>= 0.8-2), digest, functional, stringr, plyr Suggests: quickcheck Collate: basic.R keyval.R IO.R local.R streaming.R mapreduce.R extras.R ..... Maintainer: Revolution Analytics rhadoop@revolutionanalytics.com Description: Supports the map reduce programming model on top of hadoop streaming License: Apache License (== 2.0) Packaged: 2012-12-05 03:35:30 UTC; antonio Built: R 3.1.2; x86_64-redhat-linux-gnu; 2015-03-12 22:30:28 UTC; unix

-- File: /usr/lib64/R/library/rmr2/Meta/package.rds

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/161#issuecomment-86253467 .

kardes commented 9 years ago

made it! thanks!