RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

java.io.FileNotFoundException: File does not exist #137

Closed geofferyzh closed 12 years ago

geofferyzh commented 12 years ago

I installed RHadoop (rmr2,rhdfs packages) on my cloudera cdh3 virtual machine yesterday. When I tried to run the second tutorial example, the job seemed to finish correctly, but a "FileNotFoundException" occurred.

I had the same error message when trying to run the kmeans.R example. What did I do wrong here?

Thanks, Shaohua


library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: functional library(rhdfs) Loading required package: rJava

HADOOP_CMD=/usr/bin/hadoop-0.20

Be sure to run hdfs.init()

hdfs.init()

groups = rbinom(32, n = 50, prob = 0.4) groups = to.dfs(groups)

12/10/05 12:08:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/05 12:08:46 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/05 12:08:46 INFO compress.CodecPool: Got brand-new compressor

from.dfs(mapreduce(input = groups, map = function(k,v) keyval(v, 1), reduce = function(k,vv) keyval(k, length(vv)))) packageJobJar: [/tmp/RtmpYyfscn/rmr-local-env163a6ff6d07e, /tmp/RtmpYyfscn/rmr-global-env163a5affd46d, /tmp/RtmpYyfscn/rmr-streaming-map163a474285bd, /tmp/RtmpYyfscn/rmr-streaming-reduce163a63a9bfda, /var/lib/hadoop-0.20/cache/training/hadoop-unjar6175181393484515689/] [] /tmp/streamjob5614144549414339994.jar tmpDir=null 12/10/05 12:09:00 INFO mapred.FileInputFormat: Total input paths to process : 1 12/10/05 12:09:00 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/training/mapred/local] 12/10/05 12:09:00 INFO streaming.StreamJob: Running job: job_201210051159_0001 12/10/05 12:09:00 INFO streaming.StreamJob: To kill this job, run: 12/10/05 12:09:00 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201210051159_0001 12/10/05 12:09:00 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210051159_0001 12/10/05 12:09:01 INFO streaming.StreamJob: map 0% reduce 0% 12/10/05 12:09:09 INFO streaming.StreamJob: map 100% reduce 0% 12/10/05 12:09:18 INFO streaming.StreamJob: map 100% reduce 100% 12/10/05 12:09:20 INFO streaming.StreamJob: Job complete: job_201210051159_0001 12/10/05 12:09:20 INFO streaming.StreamJob: Output: /tmp/RtmpYyfscn/file163a71ea0500 Exception in thread "main" java.io.FileNotFoundException: File does not exist: 3 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) $key [1] 0

$val [1] 50

piccolbo commented 12 years ago

Thanks for your report. From other reports and our own experiments it appears to be an innocuous error and it's already fixed in the upcoming 2.0.1 You can grab it from github (branch rmr-2.0.1 if you know how or just wait for the next release.

Antonio

On Fri, Oct 5, 2012 at 12:13 PM, geofferyzh notifications@github.comwrote:

I installed RHadoop (rmr2,rhdfs packages) on my cloudera cdh3 virtual machine yesterday. When I tried to run the second tutorial example, the job seemed to finish correctly, but a "FileNotFoundException" occurred.

I had the same error message when trying to run the kmeans.R example. What did I do wrong here?

Thanks,

Shaohua

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: functional library(rhdfs) Loading required package: rJava

HADOOP_CMD=/usr/bin/hadoop-0.20

Be sure to run hdfs.init()

hdfs.init()

groups = rbinom(32, n = 50, prob = 0.4) groups = to.dfs(groups)

12/10/05 12:08:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/05 12:08:46 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/05 12:08:46 INFO compress.CodecPool: Got brand-new compressor

from.dfs(mapreduce(input = groups, map = function(k,v) keyval(v, 1), reduce = function(k,vv) keyval(k, length(vv)))) packageJobJar: [/tmp/RtmpYyfscn/rmr-local-env163a6ff6d07e, /tmp/RtmpYyfscn/rmr-global-env163a5affd46d, /tmp/RtmpYyfscn/rmr-streaming-map163a474285bd, /tmp/RtmpYyfscn/rmr-streaming-reduce163a63a9bfda, /var/lib/hadoop-0.20/cache/training/hadoop-unjar6175181393484515689/] [] /tmp/streamjob5614144549414339994.jar tmpDir=null 12/10/05 12:09:00 INFO mapred.FileInputFormat: Total input paths to process : 1 12/10/05 12:09:00 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/training/mapred/local] 12/10/05 12:09:00 INFO streaming.StreamJob: Running job: job_201210051159_0001 12/10/05 12:09:00 INFO streaming.StreamJob: To kill this job, run: 12/10/05 12:09:00 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201210051159_0001 12/10/05 12:09:00 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210051159_0001 12/10/05 12:09:01 INFO streaming.StreamJob: map 0% reduce 0% 12/10/05 12:09:09 INFO streaming.StreamJob: map 100% reduce 0% 12/10/05 12:09:18 INFO streaming.StreamJob: map 100% reduce 100% 12/10/05 12:09:20 INFO streaming.StreamJob: Job complete: job_201210051159_0001 12/10/05 12:09:20 INFO streaming.StreamJob: Output: /tmp/RtmpYyfscn/file163a71ea0500 Exception in thread "main" java.io.FileNotFoundException: File does not exist: 3 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) $key [1] 0

$val [1] 50

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137.

geofferyzh commented 12 years ago

Thanks for your quick reply.

piccolbo commented 12 years ago

Are you sure the result is correct? maybe you should run it once more but print groups before calling to.dfs so that you know what to expect. It appears your sample had 50 0s in it.

geofferyzh commented 12 years ago

Ok, so I ran the following code to print the output, but got something not readable...

hadoop fs -cat /tmp/RtmpYyfscn/file163a403e27e2/part-00000

output: Q/org2[training@server0 ~]$ .TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritableͤ����-���M���

what could have gone wrong?

piccolbo commented 12 years ago

And where did it say that the default format is a text format? If you really want to delve into the minutiae of formats like I have to do, pipe that though hexdump, but it still takes some serious pain tolerance. My suggestion: use from.dfs. If the data is to big, from.dfs(rmr.sample(.... If you need that data outside R possibly all the other formats are a better choice, I would recommend "sequence.typedbytes" to connect with the rest of the Java world or "csv" in any other case.

Antonio

On Fri, Oct 5, 2012 at 12:46 PM, geofferyzh notifications@github.comwrote:

Ok, so I ran the following code to print the output, but got something not readable...

hadoop fs -cat /tmp/RtmpYyfscn/file163a403e27e2/part-00000

output: Q/org2[training@server0 ~]$ .TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritableͤ����-���M���

what could have gone wrong?

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9187480.

geofferyzh commented 12 years ago

Thanks. I'm new to both Hadoop and R and only worked with text file in hadoop. So I wrongfully assumed that "everything" is in text format...

The result I got was incorrect. I got all 50 0s. "from.dfs(to.dfs(rbinom(32, n = 50, prob = 0.4)))" gives me all zeros.

$key NULL

$val [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0

Any pointers?

piccolbo commented 12 years ago

Are you using a 32 bit platform (in Unix and OS X, at the shell prompt enter uname -a, in windows I don't know)? Yesterday we found a problem with serialization on 32 bit platforms. It's not something Revolution supports but a user is working on a patch so this may get fixed in the next release.

Antonio

On Tue, Oct 9, 2012 at 8:04 AM, geofferyzh notifications@github.com wrote:

Thanks. I'm new to both Hadoop and R and only worked with text file in hadoop. So I wrongfully assumed that "everything" is in text format...

The result I got was incorrect. I got all 50 0s. "from.dfs(to.dfs(rbinom(32, n = 50, prob = 0.4)))" gives me all zeros.

$key NULL

$val [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0

Any pointers?

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119.

geofferyzh commented 12 years ago

Thanks. I'm using Cloudera's CDH3 training VM on Windows 7.

While waiting for the next release, is there a way that I can install RMR1 so that I can start learning RHadoop ?

piccolbo commented 12 years ago

Given that there are some important changes in the API moving to rmr2 and that it seems in general simpler for people to pick up, it doesn't seem a good investment of your time. What I suggest is that you change your VM, not rmr version. I am running CDH4 cloudera VM in virtual box and it is running 64 bit. Could you do a uname -a at a terminal inside the VM and paste the output in a message? This is what I get

[cloudera@localhost ~]$ uname -a Linux localhost.localdomain 2.6.18-308.8.2.el5 #1 SMP Tue Jun 12 09:58:12 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [cloudera@localhost ~]$

From the virtualbox manual it appears that it's a matter of choosing the right configuration at VM creation time. It may be different with the virtualization sw you are using. Thanks

Antonio

On Tue, Oct 9, 2012 at 10:08 AM, geofferyzh notifications@github.comwrote:

Thanks. I'm using Cloudera's CDH3 training VM on Windows 7.

While waiting for the next release, is there a way that I can install RMR1 so that I can start learning RHadoop ?

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9269846.

geofferyzh commented 12 years ago

Linux server0.training.local 2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:10:56 EDT 2011 i686 i686 i386 GNU/Linux

This is the printout i got.

I will try to install CDH4, thanks

creggian commented 12 years ago

I have the same issue geofferyzh had above https://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119

so I'm working on CentOS 5.8 32bit, I will wait for the patch.

piccolbo commented 12 years ago

The patch is in the 2.0.1 branch but nobody has the resources to test it. Your best bet is to take the matter in your own hands and test and fix it. It's a conversion issue so if you do a from.dfs(to.dsf(1:10)) you'll have the answer we need. Your even better bet is to switch to 64bit because of the very thin support for 32 bits from most of the industry. Because I put a tentative patch in I don't want you to think that 32 bit is making a roaring come back, 32 bit is still on its way out.

On Fri, Oct 19, 2012 at 3:20 AM, Claudio Reggiani notifications@github.comwrote:

I have the same issue geofferyzh had above RevolutionAnalytics/RHadoop#137https://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119

so I'm working on CentOS 5.8 32bit, I will wait for the patch.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9595843.

creggian commented 12 years ago
> from.dfs(to.dfs(1:10))
12/10/20 00:13:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:16 INFO compress.CodecPool: Got brand-new compressor
12/10/20 00:13:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:20 INFO compress.CodecPool: Got brand-new decompressor
$key
NULL

$val
 [1]  1  2  3  4  5  6  7  8  9 10

>

I think the patch is working, but there is another error, while calling mapreduce function (whatever from the tutorial) I get

12/10/20 00:12:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:12:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:12:01 INFO compress.CodecPool: Got brand-new compressor
Error in do.call(paste.options, backend.parameters) : 
  second argument must be a list
>

Finally I agree with you for the architecture, I'm using rhadoop on my personal computer and I'm not deploying anything in production mode, so I need to study MapReduce, Hadoop, R together.

Thanks for all

piccolbo commented 12 years ago

On Fri, Oct 19, 2012 at 3:17 PM, Claudio Reggiani notifications@github.comwrote:

from.dfs(to.dfs(1:10)) 12/10/20 00:13:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/20 00:13:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/20 00:13:16 INFO compress.CodecPool: Got brand-new compressor 12/10/20 00:13:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/20 00:13:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/20 00:13:20 INFO compress.CodecPool: Got brand-new decompressor $key NULL

$val [1] 1 2 3 4 5 6 7 8 9 10

I think the patch is working, but there is another error, while calling mapreduce function (whatever from the tutorial) I get

12/10/20 00:12:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/20 00:12:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/20 00:12:01 INFO compress.CodecPool: Got brand-new compressor Error in do.call(paste.options, backend.parameters) : second argument must be a list

That was an obvious bug in my patch, sorry about that I checked in before testing. Not this time, I will check in after testing, which is going on right now.

Finally I agree with you for the architecture, I'm using rhadoop on my personal computer and I'm not deploying anything in production mode, so I need to study MapReduce, Hadoop, R together.

All of these run on 64-bit.

A

Thanks for all

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9622761.

piccolbo commented 12 years ago

@Nophiq I may have missed one of your reports (the backend.parameters problem) in the midst of your message, i believe I fixed it based on a separate report but please check again for me in the 2.0.1 branch. Also, it would really help me if we keep it to one problem per issue and the reason is that an issue is either open or close, if there are two problems in one issue I can't mark one closed and other open.