RevolutionAnalytics / rhdfs

A package that allows R developers to use Hadoop HDFS
64 stars 73 forks source link

I cannot read the output of a mapreduce job #4

Open fonsoim opened 11 years ago

fonsoim commented 11 years ago

I cannot read the output of a mapreduce job.

The code:

data=to.dfs(1:10) res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v)) print(res())

[1] "/tmp/Rtmpr5Xv1g/file34916a6426bf"

And then....

from.dfs(res)

Exception in thread "main" java.io.FileNotFoundException: File does not exist: /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs ... ...

Finally,

hdfs.ls("/tmp/Rtmpr5Xv1g/file34916a6426bf")

permission owner group size modtime 1 -rw------- daniel supergroup 0 2013-05-13 18:24 2 drwxrwxrwt daniel supergroup 0 2013-05-13 18:23 3 -rw------- daniel supergroup 448 2013-05-13 18:24 4 -rw------- daniel supergroup 122 2013-05-13 18:23 file 1 /tmp/Rtmpr5Xv1g/file34916a6426bf/_SUCCESS 2 /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs 3 /tmp/Rtmpr5Xv1g/file34916a6426bf/part-00000 4 /tmp/Rtmpr5Xv1g/file34916a6426bf/part-00001

I note that /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs is a directory

Why does the program search the file "_logs" when it is a directory??????

Thanks in advance

Alfonso

piccolbo commented 11 years ago

On Mon, May 13, 2013 at 9:34 AM, fonsoim notifications@github.com wrote:

I cannot read the output of a mapreduce job.

The code:

data=to.dfs(1:10) res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v)) print(res())

This is not documented, you are not supposed to do it, it could break in the next bugfix release, any code using it should be considered incorrect and you are doing a disservice to the project by posting it. Just so you know.

[1] "/tmp/Rtmpr5Xv1g/file34916a6426bf"

And then....

from.dfs(res)

Can you post the output of traceback() called immediately after this call? What versions of rmr2 and hadoop are you using?

Antonio

Exception in thread "main" java.io.FileNotFoundException: File does not exist: /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs ... ...

Finally,

hdfs.ls("/tmp/Rtmpr5Xv1g/file34916a6426bf")

permission owner group size modtime 1 -rw------- daniel supergroup 0 2013-05-13 18:24 2 drwxrwxrwt daniel supergroup 0 2013-05-13 18:23 3 -rw------- daniel supergroup 448 2013-05-13 18:24 4 -rw------- daniel supergroup 122 2013-05-13 18:23 file 1 /tmp/Rtmpr5Xv1g/file34916a6426bf/_SUCCESS 2 /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs 3 /tmp/Rtmpr5Xv1g/file34916a6426bf/part-00000 4 /tmp/Rtmpr5Xv1g/file34916a6426bf/part-00001

I note that /tmp/Rtmpr5Xv1g/file34916a6426bf/_logs is a directory

Why does the program search the file "_logs" when it is a directory??????

Thanks in advance

Alfonso

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rhdfs/issues/4 .

fonsoim commented 11 years ago

Sorry for submitting the same problem in different places.

I do not understand why I am not supposed to do this code. It is a simple example like in https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

The versions of rmr2 and hadoop are 2.1.0 and 2.0.0, respectively.

The code:

data=to.dfs(1:10) res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v)) from.dfs(res)

The error:

from.dfs(res) DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it.

/usr/lib/hadoop-hdfs/bin/hdfs: line 24: /usr/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory /usr/lib/hadoop-hdfs/bin/hdfs: line 130: cygpath: command not found /usr/lib/hadoop-hdfs/bin/hdfs: line 162: exec: : not found Exception in thread "main" java.io.FileNotFoundException: File does not exist: /tmp/RtmpzXyC7B/file34c6342d57ed/_logs at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1231) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1213) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:392) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:972)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:960)
    at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:171)
    at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:138)
    at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:131)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1117)
    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:249)
    at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:82)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:746)
    at org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56)
    at org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:102)
    at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:83)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/RtmpzXyC7B/file34c6342d57ed/_logs at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1231) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1213) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:392) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

    at org.apache.hadoop.ipc.Client.call(Client.java:1225)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
    at $Proxy9.getBlockLocations(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
    at $Proxy9.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:154)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:970)
    ... 19 more

$key list()

$val list()

piccolbo commented 11 years ago

On Tue, May 21, 2013 at 1:21 AM, fonsoim notifications@github.com wrote:

Sorry for submitting the same problem in different places.

I do not understand why I am not supposed to do this code. It is a simple example like in https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

Where did you get the res() call and exposing the internal representation of a big data object? Not from me.

The versions of rmr2 and hadoop are 2.1.0 and 2.0.0, respectively.

How about the OS? Are you running windows? If so unfortunately it's not supported yet. If you are on linux, let's do this experiment. In R call

to.dfs(1:10, output = "/tmp/ls-test")

At the shell prompt try

hadoop dfs -ls /tmp/ls-test

The first two errors that you get point to hadoop problems independent of R and this little experiment will help confirm that.

Antonio

The code:

data=to.dfs(1:10) res = mapreduce(input = data, map = function(k, v) cbind(v, 2*v)) from.dfs(res)

The error:

from.dfs(res) DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it.

/usr/lib/hadoop-hdfs/bin/hdfs: line 24: /usr/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory /usr/lib/hadoop-hdfs/bin/hdfs: line 130: cygpath: command not found

This is where I suspect you are running the windows version.

/usr/lib/hadoop-hdfs/bin/hdfs: line 162: exec: : not found Exception in thread "main" java.io.FileNotFoundException: File does not exist: /tmp/RtmpzXyC7B/file34c6342d57ed/_logs at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1231) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1213) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:392) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:972)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:960)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:171)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:138)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:131)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1117)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:249)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:746)
at org.apache.hadoop.streaming.AutoInputFormat.getRecordReader(AutoInputFormat.java:56)
at org.apache.hadoop.streaming.DumpTypedBytes.dumpTypedBytes(DumpTypedBytes.java:102)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:83)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /tmp/RtmpzXyC7B/file34c6342d57ed/_logs at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1258) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1231) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1213) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:392) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:170) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44064) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)

at org.apache.hadoop.ipc.Client.call(Client.java:1225)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy9.getBlockLocations(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy9.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:154)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:970)
... 19 more

$key list()

$val list()

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rhdfs/issues/4#issuecomment-18195085 .

fonsoim commented 11 years ago

The OS is Ubuntu 12.04

I did your experinent:

to.dfs(1:10, output = "/tmp/ls-test") hadoop dfs -ls /tmp/ls-test

It works. The HDFS contains the file located in "/tmp/ls-test". Then, I list the file at the shell prompt.

piccolbo commented 11 years ago

Maybe we have two problems here. One is that you have a configuration error. It doesn't seem to be very common for googling around, nonetheless I suspect you won't be up an running until you fix it. Take a look at this reporthttp://hortonworks.com/community/forums/topic/unable-to-start-the-datanode/and see if you can get some insight as to what is wrong with your configuration. The other is from.dfs trying to read the logs directory. This is puzzling. There is an explicit filter that discards anything starting with . Could you try this in R:

rmr2:::part.list("/tmp/ls-test")

I am not sure what the connection between the two problems could be, but related or not we need to solve both to make progress. Thanks

Antonio

On Wed, May 22, 2013 at 1:27 AM, fonsoim notifications@github.com wrote:

The OS is Ubuntu 12.04

I did your experinent:

to.dfs(1:10, output = "/tmp/ls-test") hadoop dfs -ls /tmp/ls-test

It works. The HDFS contains the file located in "/tmp/ls-test". Then, I list the file at the shell prompt.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rhdfs/issues/4#issuecomment-18264010 .

kardes commented 9 years ago

hi, is this resolved? I have the same problem thanks