RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

mapreduce stops while outputting, fname #141

Closed creggian closed 12 years ago

creggian commented 12 years ago

I'm following the rmr2 tutorial.

  small.ints = to.dfs(1:1000)
  mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2))

and that is the log I have

> library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: functional
> small.ints = to.dfs(1:1000)
12/10/18 07:42:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/18 07:42:50 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/18 07:42:50 INFO compress.CodecPool: Got brand-new compressor
> mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2))
packageJobJar: [/tmp/RtmpmulW54/rmr-local-envff576eb9847, /tmp/RtmpmulW54/rmr-global-envff552d9f066, /tmp/RtmpmulW54/rmr-streaming-mapff53d4f7035, /tmp/hadoop-cloudera/hadoop-unjar8328009872758481942/] [] /tmp/streamjob3885039090765379535.jar tmpDir=null
12/10/18 07:43:02 INFO mapred.FileInputFormat: Total input paths to process : 1
12/10/18 07:43:03 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-cloudera/mapred/local]
12/10/18 07:43:03 INFO streaming.StreamJob: Running job: job_201210180740_0001
12/10/18 07:43:03 INFO streaming.StreamJob: To kill this job, run:
12/10/18 07:43:03 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201210180740_0001
12/10/18 07:43:03 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210180740_0001
12/10/18 07:43:04 INFO streaming.StreamJob:  map 0%  reduce 0%
12/10/18 07:43:17 INFO streaming.StreamJob:  map 100%  reduce 0%
12/10/18 07:43:20 INFO streaming.StreamJob:  map 100%  reduce 100%
12/10/18 07:43:20 INFO streaming.StreamJob: Job complete: job_201210180740_0001
12/10/18 07:43:20 INFO streaming.StreamJob: Output: /tmp/RtmpmulW54/fileff56d81a980
function () 
{
    fname
}
<environment: 0x8d52498>
>

Everything works fine but it seems it isn't able to complete the output. I don't have enough information to where put my hands on.

Here it is the userlogs

[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000000_0/stderr 
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: functional
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/stderr 
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: methods
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: functional
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000002_0/stderr 
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000003_0/stderr 
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_00000
attempt_201210180740_0001_m_000000_0/ attempt_201210180740_0001_m_000001_0/ attempt_201210180740_0001_m_000002_0/ attempt_201210180740_0001_m_000003_0/
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/
log.index  stderr     stdout     syslog     
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/log.index 
LOG_DIR:/usr/lib/hadoop-0.20/bin/../logs/userlogs/job_201210180740_0001/attempt_201210180740_0001_m_000001_0
stdout:0 -1
stderr:0 -1
syslog:0 -1
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/stdout 
[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/syslog 
2012-10-18 07:43:08,699 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-10-18 07:43:08,901 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/job.jar <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/job.jar
2012-10-18 07:43:08,917 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/.job.jar.crc <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/.job.jar.crc
2012-10-18 07:43:08,927 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/rmr-local-envff576eb9847 <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/rmr-local-envff576eb9847
2012-10-18 07:43:08,937 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/rmr-streaming-mapff53d4f7035 <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/rmr-streaming-mapff53d4f7035
2012-10-18 07:43:08,946 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/rmr-global-envff552d9f066 <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/rmr-global-envff552d9f066
2012-10-18 07:43:09,031 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2012-10-18 07:43:09,428 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2012-10-18 07:43:09,466 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@73a7ab
2012-10-18 07:43:09,782 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2012-10-18 07:43:09,785 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2012-10-18 07:43:09,810 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2012-10-18 07:43:10,137 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2012-10-18 07:43:10,200 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/Rscript, rmr-streaming-mapff53d4f7035]
2012-10-18 07:43:14,687 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done
2012-10-18 07:43:14,704 INFO org.apache.hadoop.streaming.PipeMapRed: mapRedFinished
2012-10-18 07:43:15,114 INFO org.apache.hadoop.mapred.Task: Task:attempt_201210180740_0001_m_000001_0 is done. And is in the process of commiting
2012-10-18 07:43:16,285 INFO org.apache.hadoop.mapred.Task: Task attempt_201210180740_0001_m_000001_0 is allowed to commit now
2012-10-18 07:43:16,333 INFO org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_201210180740_0001_m_000001_0' to hdfs://localhost:9000/tmp/RtmpmulW54/fileff56d81a980
2012-10-18 07:43:16,364 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201210180740_0001_m_000001_0' done.
2012-10-18 07:43:16,372 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

That is the environment I'm working on:

creggian commented 12 years ago

I have other information. I gave a look at R/mapreduce.R in the rmr2 package, and there are lots of functions that begin with dfs.<...> but from R console I can use dfs.empty and dfs.size only.

So I defined dfs.tempfile in the console, and I use it:

> dfs.tempfile()
function() {fname}
<environment: 0x8e71a5c>

and that's the error above

EDIT: to install rmr2, i run (recalling) $ sudo R CMD build rmr2...tar.gz $ sudo R CMD INSTALL < dir >/pkg

piccolbo commented 12 years ago

This seems to have worked just fine to me. If you want to see the numbers you can wrap the map reduce call in a from.dfs call

Sent from a phone On Oct 18, 2012 2:10 AM, "Claudio Reggiani" notifications@github.com wrote:

I'm following the rmr2 tutorial.

small.ints = to.dfs(1:1000) mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2))

and that is the log I have

library(rmr2) Loading required package: Rcpp Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: functional small.ints = to.dfs(1:1000) 12/10/18 07:42:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/18 07:42:50 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/18 07:42:50 INFO compress.CodecPool: Got brand-new compressor mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) packageJobJar: [/tmp/RtmpmulW54/rmr-local-envff576eb9847, /tmp/RtmpmulW54/rmr-global-envff552d9f066, /tmp/RtmpmulW54/rmr-streaming-mapff53d4f7035, /tmp/hadoop-cloudera/hadoop-unjar8328009872758481942/] [] /tmp/streamjob3885039090765379535.jar tmpDir=null 12/10/18 07:43:02 INFO mapred.FileInputFormat: Total input paths to process : 1 12/10/18 07:43:03 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-cloudera/mapred/local] 12/10/18 07:43:03 INFO streaming.StreamJob: Running job: job_201210180740_0001 12/10/18 07:43:03 INFO streaming.StreamJob: To kill this job, run: 12/10/18 07:43:03 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201210180740_0001 12/10/18 07:43:03 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210180740_0001 12/10/18 http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210180740_000112/10/18 07:43:04 INFO streaming.StreamJob: map 0% reduce 0% 12/10/18 07:43:17 INFO streaming.StreamJob: map 100% reduce 0% 12/10/18 07:43:20 INFO streaming.StreamJob: map 100% reduce 100% 12/10/18 07:43:20 INFO streaming.StreamJob: Job complete: job_201210180740_0001 12/10/18 07:43:20 INFO streaming.StreamJob: Output: /tmp/RtmpmulW54/fileff56d81a980 function () { fname } <environment: 0x8d52498>

Everything works fine but it seems it isn't able to complete the output. I don't have enough information to where put my hands on.

Here it is the userlogs

[cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000000_0/stderr Loading required package: Rcpp Loading required package: RJSONIO Loading required package: methods Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: functional [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/stderr Loading required package: Rcpp Loading required package: RJSONIO Loading required package: methods Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: functional [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000002_0/stderr [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000003_0/stderr [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_00000 attempt_201210180740_0001_m_000000_0/ attempt_201210180740_0001_m_000001_0/ attempt_201210180740_0001_m_000002_0/ attempt_201210180740_0001_m_000003_0/ [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/ log.index stderr stdout syslog [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/log.index LOG_DIR:/usr/lib/hadoop-0.20/bin/../logs/userlogs/job_201210180740_0001/attempt_201210180740_0001_m_000001_0 stdout:0 -1 stderr:0 -1 syslog:0 -1 [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/stdout [cloudera@localhost job_201210180740_0001]$ cat attempt_201210180740_0001_m_000001_0/syslog 2012-10-18 07:43:08,699 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2012-10-18 07:43:08,901 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/job.jar <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/job.jar 2012-10-18 07:43:08,917 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/.job.jar.crc <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/.job.jar.crc 2012-10-18 07:43:08,927 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/rmr-local-envff576eb9847 <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/rmr-local-envff576eb9847 2012-10-18 07:43:08,937 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/rmr-streaming-mapff53d4f7035 <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/rmr-streaming-mapff53d4f7035 2012-10-18 07:43:08,946 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/jars/rmr-global-envff552d9f066 <- /tmp/hadoop-cloudera/mapred/local/taskTracker/cloudera/jobcache/job_201210180740_0001/attempt_201210180740_0001_m_000001_0/work/rmr-global-envff552d9f066 2012-10-18 07:43:09,031 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2012-10-18 07:43:09,428 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2012-10-18 07:43:09,466 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@73a7ab 2012-10-18 07:43:09,782 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2012-10-18 07:43:09,785 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2012-10-18 07:43:09,810 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2012-10-18 07:43:10,137 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2012-10-18 07:43:10,200 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/Rscript, rmr-streaming-mapff53d4f7035] 2012-10-18 07:43:14,687 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread done 2012-10-18 07:43:14,704 INFO org.apache.hadoop.streaming.PipeMapRed: mapRedFinished 2012-10-18 07:43:15,114 INFO org.apache.hadoop.mapred.Task: Task:attempt_201210180740_0001_m_000001_0 is done. And is in the process of commiting 2012-10-18 07:43:16,285 INFO org.apache.hadoop.mapred.Task: Task attempt_201210180740_0001_m_000001_0 is allowed to commit now 2012-10-18 07:43:16,333 INFO org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_201210180740_0001_m_000001_0' to hdfs://localhost:9000/tmp/RtmpmulW54/fileff56d81a980 2012-10-18 07:43:16,364 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201210180740_0001_m_000001_0' done. 2012-10-18 07:43:16,372 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

That is the environment I'm working on:

  • CentOS 5.8
  • $ java -version java version "1.6.0_22" OpenJDK Runtime Environment (IcedTea6 1.10.4) (rhel-1.24.1.10.4.el5-i386) OpenJDK Client VM (build 20.0-b11, mixed mode)
  • $ hadoop version Hadoop 0.20.2-cdh3u5 Subversion file:///data/1/tmp/topdir/BUILD/hadoop-0.20.2-cdh3u5 -r de14a95https://github.com/RevolutionAnalytics/RHadoop/commit/de14a95e895a72e7b2501bbe628c1e23578aae29Compiled by root on Wed Aug 22 14:57:44 PDT 2012 From source with checksum 32e743fc1528087177062231df2d5171
  • R version 2.15.1 (2012-06-22)
  • rmr2 version 2.0.0

    — Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/141.