RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

rmr job fails with output.format option #133

Closed aruntewatia closed 12 years ago

aruntewatia commented 12 years ago

Hi,

I have configured 3 node hadoop-1.0.3 cluster, having ubuntu 12.04 on 3 virtual box virtual machines. I followed following link for installation of rmr_1.3.1.tar.gz http://www.hadoopconsultant.nl/howto-install-hadoop-rmr-on-ubuntu/

I followed this tutorial trying to run wordcount https://github.com/jeffreybreen/tutorial-201209-TDWI-big-data/tree/master/presentation/RHadoop.pdf

The output is visible in dataframe

df = as.data.frame( from.dfs(out, structured=T) ) head(df)

The problem is that when I look at the output file in hdfs through browser, it show junk charachters.

Also If I add output.format = "text" in the mapreduce function, the rmr job fails with following stderr logs

Loading required package: Rcpp Loading required package: methods Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: rhdfs Loading required package: rJava Error : .onLoad failed in loadNamespace() for 'rhdfs', details: call: fun(libname, pkgname) error: Environment variable HADOOP_CMD must be set before loading package rhdfs java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:124) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415)

The first example in RMR tutorial completed successfully small.ints = to.dfs(1:1000) mapreduce(input = small.ints, map = function(k,v) keyval(v, v^2), ouput.format = "text")

I have tried running rmr job from 2 user accounts, i.e. root & hadoop. library(rmr) & library(rhdfs) loads successfully for each user, also environment variables HADOOP_CMD and HADOOP_STREAMING are set properly.

Any idea for solution......

Thanks in advance Arun Tewatia

piccolbo commented 12 years ago

On Tue, Sep 25, 2012 at 10:01 PM, aruntewatia notifications@github.comwrote:

Hi,

I have configured 3 node hadoop-1.0.3 cluster, having ubuntu 12.04 on 3 virtual box virtual machines. I followed following link for installation of rmr_1.3.1.tar.gz http://www.hadoopconsultant.nl/howto-install-hadoop-rmr-on-ubuntu/

I followed this tutorial trying to run wordcount

https://github.com/jeffreybreen/tutorial-201209-TDWI-big-data/tree/master/presentation/RHadoop.pdf

The output is visible in dataframe

df = as.data.frame( from.dfs(out, structured=T) ) head(df)

The problem is that when I look at the output file in hdfs through browser, it show junk charachters.

This is not a problem, it's a binary format. Use an exadecimal viewer if you are interested in the format itself.

Also If I add

output.format = "text" in the mapreduce function, the rmr job fails with following stderr logs

Loading required package: Rcpp Loading required package: methods Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: rhdfs Loading required package: rJava Error : .onLoad failed in loadNamespace() for 'rhdfs', details: call: fun(libname, pkgname) error: Environment variable HADOOP_CMD must be set before loading package rhdfs

This is a different problem, you have not configured properly rhdfs on the node. Detach it before calling mapreduce if you don't need it or make sure that the environment is properly set on each node.

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:124) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415)

The first example in RMR tutorial completed successfully small.ints = to.dfs(1:1000) mapreduce(input = small.ints, map = function(k,v) keyval(v, v^2), ouput.format = "text")

I have tried running rmr job from 2 user accounts, i.e. root & hadoop. library(rmr) & library(rhdfs) loads successfully for each user, also environment variables HADOOP_CMD and HADOOP_STREAMING are set properly.

Any idea for solution......

Try output.format = "csv", with a data frame I suspect that's what you really meant.

Antonio

Thanks in advance Arun Tewatia

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/133.

aruntewatia commented 12 years ago

Hi piccolbo,

Thanks for a quick reply.... I worked on your sugestions........ I removed rhdfs from all nodes and reconfigured it. Also I confirmed that java environment is set properly with all environment variables and also reconfigured R environment using command "R CMD javareconf"

The error related to rhdfs is not occurring now.......... But the rmr job is stil failing if i give output.format = "text" or "csv"

The output of sterr is as follows :

Loading required package: Rcpp Loading required package: methods Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: Rcpp Loading required package: methods Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Error: ignoring SIGPIPE signal Execution halted Warning: no graphics system to unregister * glibc detected * /usr/lib/R/bin/exec/R: double free or corruption (!prev): 0x00000000014108f0 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x7e626)[0x7fe0428ef626] /lib/x86_64-linux-gnu/libc.so.6(fclose+0x155)[0x7fe0428df2a5] .............. ............ .......... java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:97) at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1431) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137) at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1436) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249)

On the R terminal the output is --------------------

wordcount("/tmp/gutenberg.txt", "/tmp/temp1")

packageJobJar: [/tmp/RtmpRjT9PL/rmr-local-env, /tmp/RtmpRjT9PL/rmr-global-env, /tmp/RtmpRjT9PL/rhstr.map32983d02758, /tmp/RtmpRjT9PL/rhstr.reduce32987c450c44, /tmp/RtmpRjT9PL/rhstr.combine32984b52fa86, /home/hadoop/hadoop-1.0.3/hdfs-tmp/hadoop-unjar3818777795746527490/] [] /tmp/streamjob8597560344974524382.jar tmpDir=null 12/09/26 16:13:30 INFO mapred.FileInputFormat: Total input paths to process : 1 12/09/26 16:13:30 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/hadoop-1.0.3/mapred-tmp] 12/09/26 16:13:30 INFO streaming.StreamJob: Running job: job_201209261603_0002 12/09/26 16:13:30 INFO streaming.StreamJob: To kill this job, run: 12/09/26 16:13:30 INFO streaming.StreamJob: /media/sf_HadoopShare/libexec/../bin/hadoop job -Dmapred.job.tracker=raftaar-VirtualBox106:54311 -kill job_201209261603_0002 12/09/26 16:13:30 INFO streaming.StreamJob: Tracking URL: http://raftaar-VirtualBox106:50030/jobdetails.jsp?jobid=job_201209261603_0002 12/09/26 16:13:31 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:13:47 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:13:51 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:13:53 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:13:57 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:14:06 INFO streaming.StreamJob: map 47% reduce 0% 12/09/26 16:14:08 INFO streaming.StreamJob: map 97% reduce 0% 12/09/26 16:14:09 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:14:14 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:14:21 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:14:33 INFO streaming.StreamJob: map 48% reduce 0% 12/09/26 16:14:36 INFO streaming.StreamJob: map 97% reduce 0% 12/09/26 16:14:39 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:14:42 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:14:45 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:14:54 INFO streaming.StreamJob: map 48% reduce 0% 12/09/26 16:14:57 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:15:04 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:15:13 INFO streaming.StreamJob: map 100% reduce 100% 12/09/26 16:15:13 INFO streaming.StreamJob: To kill this job, run: 12/09/26 16:15:13 INFO streaming.StreamJob: /media/sf_HadoopShare/libexec/../bin/hadoop job -Dmapred.job.tracker=raftaar-VirtualBox106:54311 -kill job_201209261603_0002 12/09/26 16:15:13 INFO streaming.StreamJob: Tracking URL: http://raftaar-VirtualBox106:50030/jobdetails.jsp?jobid=job_201209261603_0002 12/09/26 16:15:13 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201209261603_0002_m_000001 12/09/26 16:15:13 INFO streaming.StreamJob: killJob... Streaming Command Failed! Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { : hadoop streaming failed with error code 1

Any Idea ??

Thanks Arun Tewatia

piccolbo commented 12 years ago

It could be you run into this R bug https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14766 I am thinking through our options to verify that's the case and fix it.

A

On Wed, Sep 26, 2012 at 3:53 AM, aruntewatia notifications@github.comwrote:

Hi piccolbo,

Thanks for a quick reply.... I worked on your sugestions........ I removed rhdfs from all nodes and reconfigured it. Also I confirmed that java environment is set properly with all environment variables and also reconfigured R environment using command "R CMD javareconf"

The error related to rhdfs is not occurring now.......... But the rmr job is stil failing if i give output.format = "text" or "csv"

The output of sterr is as follows :

Loading required package: Rcpp Loading required package: methods Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Loading required package: Rcpp Loading required package: methods Loading required package: RJSONIO Loading required package: itertools Loading required package: iterators Loading required package: digest Error: ignoring SIGPIPE signal Execution halted Warning: no graphics system to unregister * glibc detected * /usr/lib/R/bin/exec/R: double free or corruption (!prev): 0x00000000014108f0 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x7e626)[0x7fe0428ef626] /lib/x86_64-linux-gnu/libc.so.6(fclose+0x155)[0x7fe0428df2a5] .............. ............ .......... java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134

at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:97) at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1431) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 134

at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137) at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1436) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249)

On the R terminal the output is --------------------

wordcount("/tmp/gutenberg.txt", "/tmp/temp1")

packageJobJar: [/tmp/RtmpRjT9PL/rmr-local-env, /tmp/RtmpRjT9PL/rmr-global-env, /tmp/RtmpRjT9PL/rhstr.map32983d02758, /tmp/RtmpRjT9PL/rhstr.reduce32987c450c44, /tmp/RtmpRjT9PL/rhstr.combine32984b52fa86, /home/hadoop/hadoop-1.0.3/hdfs-tmp/hadoop-unjar3818777795746527490/] [] /tmp/streamjob8597560344974524382.jar tmpDir=null 12/09/26 16:13:30 INFO mapred.FileInputFormat: Total input paths to process : 1 12/09/26 16:13:30 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/hadoop-1.0.3/mapred-tmp] 12/09/26 16:13:30 INFO streaming.StreamJob: Running job: job_201209261603_0002 12/09/26 16:13:30 INFO streaming.StreamJob: To kill this job, run: 12/09/26 16:13:30 INFO streaming.StreamJob: /media/sf_HadoopShare/libexec/../bin/hadoop job -Dmapred.job.tracker=raftaar-VirtualBox106:54311 -kill job_201209261603_0002 12/09/26 16:13:30 INFO streaming.StreamJob: Tracking URL: http://raftaar-VirtualBox106:50030/jobdetails.jsp?jobid=job_201209261603_0002 12/09/26 16:13:31 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:13:47 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:13:51 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:13:53 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:13:57 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:14:06 INFO streaming.StreamJob: map 47% reduce 0% 12/09/26 16:14:08 INFO streaming.StreamJob: map 97% reduce 0% 12/09/26 16:14:09 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:14:14 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:14:21 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:14:33 INFO streaming.StreamJob: map 48% reduce 0% 12/09/26 16:14:36 INFO streaming.StreamJob: map 97% reduce 0% 12/09/26 16:14:39 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:14:42 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:14:45 INFO streaming.StreamJob: map 0% reduce 0% 12/09/26 16:14:54 INFO streaming.StreamJob: map 48% reduce 0% 12/09/26 16:14:57 INFO streaming.StreamJob: map 100% reduce 0% 12/09/26 16:15:04 INFO streaming.StreamJob: map 50% reduce 0% 12/09/26 16:15:13 INFO streaming.StreamJob: map 100% reduce 100% 12/09/26 16:15:13 INFO streaming.StreamJob: To kill this job, run: 12/09/26 16:15:13 INFO streaming.StreamJob: /media/sf_HadoopShare/libexec/../bin/hadoop job -Dmapred.job.tracker=raftaar-VirtualBox106:54311 -kill job_201209261603_0002 12/09/26 16:15:13 INFO streaming.StreamJob: Tracking URL: http://raftaar-VirtualBox106:50030/jobdetails.jsp?jobid=job_201209261603_0002 12/09/26 16:15:13 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201209261603_0002_m_000001 12/09/26 16:15:13 INFO streaming.StreamJob: killJob... Streaming Command Failed! Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { : hadoop streaming failed with error code 1

Any Idea ??

Thanks Arun Tewatia

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/133#issuecomment-8886021.

piccolbo commented 12 years ago

You can try this

cat(paste(rep("a", 1000), collapse = ""), file = "a.txt") readLines("a.txt")

If it crashes R, then your best bet is to upgrade R

aruntewatia commented 12 years ago

Hi piccolbo,

Sorry for late reply. I tried your suggestion, for checking of mentioned bug. I got following output. R didn't crash....

cat(paste(rep("a", 1000), collapse = ""), file = "a.txt") readLines("a.txt") [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" Warning message: In readLines("a.txt") : incomplete final line found on 'a.txt'

Anyways I upgraded R from 2.14 to 2.15 .............

Thanks Arun Tewatia

aruntewatia commented 12 years ago

Hi,

I am trying to define delimited input format, suggested in rmr tutorials. Just working my ways to start off with basic examples.....

tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = ",")[[1]] keyval(delim[[1]], list(lon = delim[[2]], lat = delim[[3]], populationcount = delim[[4]]))}} log.col = function (input, output = NULL ){ mapreduce( input = input, output = tempfile(), input.format = tsv.reader, map = function(k, v){ keyval(k, log(v$lon))} ) } log.col("/user/hadoop/tempfile.csv") Error in input.format$streaming.format : object of type 'closure' is not subsettable

My csv file tempfile.csv has following format

gridid,lon,lat,populationcount 1,68.162500,35.504167,-9999 2,68.170833,35.504167,-9999 3,68.179167,35.504167,-9999 4,68.187500,35.504167,-9999 5,68.195833,35.504167,-9999

Am I missing something major here ?

Thanks Arun Tewatia

piccolbo commented 12 years ago

Yes, an input format is a triple with a function (also called a format for historical reasons) a mode string (text or binary) and a java class (as a string). It is create with the function make.input.format and in your case you could call

make.input.format(format = tsv.reader, mode = "text")

with the last left to its default. You are also ignoring the role of nrecs and assuming it is 1 all the time. If you don't want to bother fixing it, just call make.input.format("csv", sep = ",") and it should parse the example you showed. Of course learning how to write an input format can become very handy later on.

Antonio

On Fri, Sep 28, 2012 at 10:09 PM, aruntewatia notifications@github.comwrote:

Hi,

I am trying to define delimited input format, suggested in rmr tutorials. Just working my ways to start off with basic examples.....

tsv.reader = function(con, nrecs){ lines = readLines(con, 1) if(length(lines) == 0) NULL else { delim = strsplit(lines, split = ",")[[1]] keyval(delim[[1]], list(lon = delim[[2]], lat = delim[[3]], populationcount = delim[[4]]))}} log.col = function (input, output = NULL ){ mapreduce( input = input, output = tempfile(), input.format = tsv.reader, map = function(k, v){ keyval(k, log(v$lon))} ) } log.col("/user/hadoop/tempfile.csv") Error in input.format$streaming.format : object of type 'closure' is not subsettable

My csv file tempfile.csv has following format

gridid,lon,lat,populationcount 1,68.162500,35.504167,-9999 2,68.170833,35.504167,-9999 3,68.179167,35.504167,-9999 4,68.187500,35.504167,-9999 5,68.195833,35.504167,-9999

Am I missing something major here ?

Thanks Arun Tewatia

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/133#issuecomment-9000914.

aruntewatia commented 12 years ago

Hi Antonio,

Thanks for your tips, I finally got basic mathematical operation working through rmr

log.col = function (input, output = NULL ){ mapreduce( input = input, output = tempfile(), structured = TRUE, input.format = make.input.format(format = tsv.reader, mode = "text"), output.format = make.output.format("csv", sep = ","), map = function(k, v){ keyval(k, as.numeric(v$lat)+as.numeric(v$lon))}, reduce = function(k, vv) keyval(k, mean(unlist(vv))) ) }

Cheers Arun Tewatia