RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

What exactly is a big.data.object? #174

Closed everdark closed 11 years ago

everdark commented 11 years ago

Sorry, me again. These days I've try several practices about the mapreduce programming framework under RHadoop. Sometimes the result was amazing but sometimes it really confused me.

To be direct, what exactly is the structure of the so-called big.data.object? What is its mode? Since R is object-oriented I found it difficult to code without the mere knowledge about the mode of the object. What exactly is the object we are touching within the map function and the reduce function?

To be concrete, please consider the following (simple) word count facility.

lines <- c("This is a test text file.", "And this is the second line.") paper <- to.dfs(lines) wordCount <- function(input, output=NULL, pattern=" ") {

wordCount.map <- function(., v) {
    keyval(unlist(strsplit(v, split=pattern)), 1)
}

wordCount.reduce <- function(k, vv) {
    keyval(k, length(vv))
}

mapreduce(
    input=input, output=output, 
    map=wordCount.map, reduce=wordCount.reduce , combine=FALSE
)

} result <- as.data.frame( from.dfs(wordCount(input=paper, pattern=" ")), stringASfactor=FALSE ) result <- result[order(result$val, decreasing=TRUE),] result

Basically it works. But if I change the reducer to be:

keyval(k, sum(vv)) # instead of length

then it failed, with the streaming job completed but only with meaningless outcome like this:

  key           val

4 is 1.059962e-314 1 a 5.299809e-315 2 And 5.299809e-315 3 file. 5.299809e-315 5 line. 5.299809e-315 6 second 5.299809e-315 7 test 5.299809e-315 8 text 5.299809e-315 9 the 5.299809e-315 10 this 5.299809e-315 11 This 5.299809e-315

What happens here? It seems that for the object "vv" read into the reducer, length() works fine while sum() simply fails to do what is expected. Things can get more complicated when the involved computation comes with more functions and the result is a mess (or streaming uncompleted at all).

The framework here to me is like a black box hiding something really matters. Or maybe I am just miss one thing or two important to know. Can any body give me a clue on this?

piccolbo commented 11 years ago

On Wed, Jan 23, 2013 at 1:09 AM, everdark notifications@github.com wrote:

Sorry, me again. These days I've try several practices about the mapreduce programming framework under RHadoop. Sometimes the result was amazing but sometimes it really confused me.

To be direct, what exactly is the structure of the so-called big.data.object?

It's a closure.

What is its mode?

Function

Since R is object-oriented

We disagree on this one. No objects were used in the making of rmr.

I found it difficult to code without the mere knowledge about the mode of the object.

The specific implementation should be of no interest to you. It's supposed to be an abstraction. You are using it just right in your code.

What exactly is the object we are touching within the map function and the reduce function?

None if you are asking S3 or S4 objects. Otherwise you need to explain to me what you mean by "touch".

To be concrete, please consider the following (simple) word count facility.

lines <- c("This is a test text file.", "And this is the second line.") paper <- to.dfs(lines) wordCount <- function(input, output=NULL, pattern=" ") {

wordCount.map <- function(., v) { keyval(unlist(strsplit(v, split=pattern)), 1) }

wordCount.reduce <- function(k, vv) { keyval(k, length(vv)) }

mapreduce( input=input, output=output, map=wordCount.map, reduce=wordCount.reduce , combine=FALSE )

} result <- as.data.frame( from.dfs(wordCount(input=paper, pattern=" ")), stringASfactor=FALSE ) result <- result[order(result$val, decreasing=TRUE),] result

Basically it works. But if I change the reducer to be:

keyval(k, sum(vv)) # instead of length

then it failed, with the streaming job completed but only with meaningless outcome liek this if I give it a print:

key val

4 is 1.059962e-314 1 a 5.299809e-315 2 And 5.299809e-315 3 file. 5.299809e-315 5 line. 5.299809e-315 6 second 5.299809e-315 7 test 5.299809e-315 8 text 5.299809e-315 9 the 5.299809e-315 10 this 5.299809e-315 11 This 5.299809e-315

What happens here?

You are running it on an unsupported platform, probably 32 bit, and the serialization breaks for floating point. Your code works for me.

It seems that for the object "vv" read into the reducer, length() works

fine while sum() simply fails to do what is expected. Things can get more complicated when the involved computation comes with more functions and the result is a mess (or streaming uncompleted at all).

The framework here to me is like a black box hiding something really matters.

The goal is to hide what doesn't matter, like the mode of the big data object. I don't want users to have to know how things work behind the scenes because that makes it harder on them and makes it impossible for me to improve things. Of course if you want to get into developing rmr itself, it's a different story.

Or maybe I am just miss one thing or two important to know.

Check your platform. You are probably seeing a lowly bug but looking in all the wrong directions.

Antonio

Can any body give me a clue on this?

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/174.