Closed geofferyzh closed 12 years ago
Thanks for your report. From other reports and our own experiments it appears to be an innocuous error and it's already fixed in the upcoming 2.0.1 You can grab it from github (branch rmr-2.0.1 if you know how or just wait for the next release.
Antonio
On Fri, Oct 5, 2012 at 12:13 PM, geofferyzh notifications@github.comwrote:
I installed RHadoop (rmr2,rhdfs packages) on my cloudera cdh3 virtual machine yesterday. When I tried to run the second tutorial example, the job seemed to finish correctly, but a "FileNotFoundException" occurred.
I had the same error message when trying to run the kmeans.R example. What did I do wrong here?
Thanks,
Shaohua
library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: functional library(rhdfs) Loading required package: rJava
HADOOP_CMD=/usr/bin/hadoop-0.20
Be sure to run hdfs.init()
hdfs.init()
groups = rbinom(32, n = 50, prob = 0.4) groups = to.dfs(groups)
12/10/05 12:08:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/05 12:08:46 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/05 12:08:46 INFO compress.CodecPool: Got brand-new compressor
from.dfs(mapreduce(input = groups, map = function(k,v) keyval(v, 1), reduce = function(k,vv) keyval(k, length(vv)))) packageJobJar: [/tmp/RtmpYyfscn/rmr-local-env163a6ff6d07e, /tmp/RtmpYyfscn/rmr-global-env163a5affd46d, /tmp/RtmpYyfscn/rmr-streaming-map163a474285bd, /tmp/RtmpYyfscn/rmr-streaming-reduce163a63a9bfda, /var/lib/hadoop-0.20/cache/training/hadoop-unjar6175181393484515689/] [] /tmp/streamjob5614144549414339994.jar tmpDir=null 12/10/05 12:09:00 INFO mapred.FileInputFormat: Total input paths to process : 1 12/10/05 12:09:00 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/training/mapred/local] 12/10/05 12:09:00 INFO streaming.StreamJob: Running job: job_201210051159_0001 12/10/05 12:09:00 INFO streaming.StreamJob: To kill this job, run: 12/10/05 12:09:00 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201210051159_0001 12/10/05 12:09:00 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210051159_0001 12/10/05 12:09:01 INFO streaming.StreamJob: map 0% reduce 0% 12/10/05 12:09:09 INFO streaming.StreamJob: map 100% reduce 0% 12/10/05 12:09:18 INFO streaming.StreamJob: map 100% reduce 100% 12/10/05 12:09:20 INFO streaming.StreamJob: Job complete: job_201210051159_0001 12/10/05 12:09:20 INFO streaming.StreamJob: Output: /tmp/RtmpYyfscn/file163a71ea0500 Exception in thread "main" java.io.FileNotFoundException: File does not exist: 3 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546) at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) $key [1] 0
$val [1] 50
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137.
Thanks for your quick reply.
Are you sure the result is correct? maybe you should run it once more but print groups before calling to.dfs so that you know what to expect. It appears your sample had 50 0s in it.
Ok, so I ran the following code to print the output, but got something not readable...
hadoop fs -cat /tmp/RtmpYyfscn/file163a403e27e2/part-00000
output: Q/org2[training@server0 ~]$ .TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritableͤ����-���M���
what could have gone wrong?
And where did it say that the default format is a text format? If you really want to delve into the minutiae of formats like I have to do, pipe that though hexdump, but it still takes some serious pain tolerance. My suggestion: use from.dfs. If the data is to big, from.dfs(rmr.sample(.... If you need that data outside R possibly all the other formats are a better choice, I would recommend "sequence.typedbytes" to connect with the rest of the Java world or "csv" in any other case.
Antonio
On Fri, Oct 5, 2012 at 12:46 PM, geofferyzh notifications@github.comwrote:
Ok, so I ran the following code to print the output, but got something not readable...
hadoop fs -cat /tmp/RtmpYyfscn/file163a403e27e2/part-00000
output: Q/org2[training@server0 ~]$ .TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritableͤ����-���M���
what could have gone wrong?
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9187480.
Thanks. I'm new to both Hadoop and R and only worked with text file in hadoop. So I wrongfully assumed that "everything" is in text format...
The result I got was incorrect. I got all 50 0s. "from.dfs(to.dfs(rbinom(32, n = 50, prob = 0.4)))" gives me all zeros.
$key NULL
$val [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0
Any pointers?
Are you using a 32 bit platform (in Unix and OS X, at the shell prompt enter
uname -a
, in windows I don't know)? Yesterday we found a problem with
serialization on 32 bit platforms. It's not something Revolution supports
but a user is working on a patch so this may get fixed in the next release.
Antonio
On Tue, Oct 9, 2012 at 8:04 AM, geofferyzh notifications@github.com wrote:
Thanks. I'm new to both Hadoop and R and only worked with text file in hadoop. So I wrongfully assumed that "everything" is in text format...
The result I got was incorrect. I got all 50 0s. "from.dfs(to.dfs(rbinom(32, n = 50, prob = 0.4)))" gives me all zeros.
$key NULL
$val [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0
Any pointers?
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119.
Thanks. I'm using Cloudera's CDH3 training VM on Windows 7.
While waiting for the next release, is there a way that I can install RMR1 so that I can start learning RHadoop ?
Given that there are some important changes in the API moving to rmr2 and that it seems in general simpler for people to pick up, it doesn't seem a good investment of your time. What I suggest is that you change your VM, not rmr version. I am running CDH4 cloudera VM in virtual box and it is running 64 bit. Could you do a uname -a at a terminal inside the VM and paste the output in a message? This is what I get
[cloudera@localhost ~]$ uname -a Linux localhost.localdomain 2.6.18-308.8.2.el5 #1 SMP Tue Jun 12 09:58:12 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [cloudera@localhost ~]$
From the virtualbox manual it appears that it's a matter of choosing the right configuration at VM creation time. It may be different with the virtualization sw you are using. Thanks
Antonio
On Tue, Oct 9, 2012 at 10:08 AM, geofferyzh notifications@github.comwrote:
Thanks. I'm using Cloudera's CDH3 training VM on Windows 7.
While waiting for the next release, is there a way that I can install RMR1 so that I can start learning RHadoop ?
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9269846.
Linux server0.training.local 2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:10:56 EDT 2011 i686 i686 i386 GNU/Linux
This is the printout i got.
I will try to install CDH4, thanks
I have the same issue geofferyzh had above https://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119
so I'm working on CentOS 5.8 32bit, I will wait for the patch.
The patch is in the 2.0.1 branch but nobody has the resources to test it. Your best bet is to take the matter in your own hands and test and fix it. It's a conversion issue so if you do a from.dfs(to.dsf(1:10)) you'll have the answer we need. Your even better bet is to switch to 64bit because of the very thin support for 32 bits from most of the industry. Because I put a tentative patch in I don't want you to think that 32 bit is making a roaring come back, 32 bit is still on its way out.
On Fri, Oct 19, 2012 at 3:20 AM, Claudio Reggiani notifications@github.comwrote:
I have the same issue geofferyzh had above RevolutionAnalytics/RHadoop#137https://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119
so I'm working on CentOS 5.8 32bit, I will wait for the patch.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9595843.
> from.dfs(to.dfs(1:10))
12/10/20 00:13:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:16 INFO compress.CodecPool: Got brand-new compressor
12/10/20 00:13:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:20 INFO compress.CodecPool: Got brand-new decompressor
$key
NULL
$val
[1] 1 2 3 4 5 6 7 8 9 10
>
I think the patch is working, but there is another error, while calling mapreduce function (whatever from the tutorial) I get
12/10/20 00:12:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:12:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:12:01 INFO compress.CodecPool: Got brand-new compressor
Error in do.call(paste.options, backend.parameters) :
second argument must be a list
>
Finally I agree with you for the architecture, I'm using rhadoop on my personal computer and I'm not deploying anything in production mode, so I need to study MapReduce, Hadoop, R together.
Thanks for all
On Fri, Oct 19, 2012 at 3:17 PM, Claudio Reggiani notifications@github.comwrote:
from.dfs(to.dfs(1:10)) 12/10/20 00:13:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/20 00:13:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/20 00:13:16 INFO compress.CodecPool: Got brand-new compressor 12/10/20 00:13:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/20 00:13:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/20 00:13:20 INFO compress.CodecPool: Got brand-new decompressor $key NULL
$val [1] 1 2 3 4 5 6 7 8 9 10
I think the patch is working, but there is another error, while calling mapreduce function (whatever from the tutorial) I get
12/10/20 00:12:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/20 00:12:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/20 00:12:01 INFO compress.CodecPool: Got brand-new compressor Error in do.call(paste.options, backend.parameters) : second argument must be a list
That was an obvious bug in my patch, sorry about that I checked in before testing. Not this time, I will check in after testing, which is going on right now.
Finally I agree with you for the architecture, I'm using rhadoop on my personal computer and I'm not deploying anything in production mode, so I need to study MapReduce, Hadoop, R together.
All of these run on 64-bit.
A
Thanks for all
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9622761.
@Nophiq I may have missed one of your reports (the backend.parameters problem) in the midst of your message, i believe I fixed it based on a separate report but please check again for me in the 2.0.1 branch. Also, it would really help me if we keep it to one problem per issue and the reason is that an issue is either open or close, if there are two problems in one issue I can't mark one closed and other open.
I installed RHadoop (rmr2,rhdfs packages) on my cloudera cdh3 virtual machine yesterday. When I tried to run the second tutorial example, the job seemed to finish correctly, but a "FileNotFoundException" occurred.
I had the same error message when trying to run the kmeans.R example. What did I do wrong here?
Thanks, Shaohua
HADOOP_CMD=/usr/bin/hadoop-0.20
Be sure to run hdfs.init()
12/10/05 12:08:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/05 12:08:46 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/05 12:08:46 INFO compress.CodecPool: Got brand-new compressor
$val [1] 50