RevolutionAnalytics / rmr2

A package that allows R developer to use Hadoop MapReduce
160 stars 149 forks source link

[Bug?] in make.json.output.format #136

Closed saargolde closed 10 years ago

saargolde commented 10 years ago

The following example is somewhat contrived, but it shows an issue I ran into with the make.json.output.format function:

ttt1 <- to.dfs(
  keyval(1:1000, 1000:1), 
  format = make.output.format(format = "csv", mode = "text", sep="\t"))
ttt2 <- mapreduce(
  input = ttt1, 
  map = to.map(identity), 
  input.format = make.input.format(format = "csv", mode = "text", sep="\t", nrows = 10), 
  output.format = "json")
ttt3 <- from.dfs(input = ttt2, format = "json")

You get an error:

Error in fromJSON(content, handler, default.size, depth, allowComments,  : 
  invalid JSON input

Looks like there is a problem in make.json.output.format, specifically in line 60. Current code:

      writeLines(paste(out, collapse = "\n"), sep = "", con = con)}

Should be:

      writeLines(paste(out, collapse = "\n"), sep = "\n", con = con)}

Otherwise, multiple outputs end up not being separated by newline, causing problems... Or at least that's my analysis of the problem.

That change fixed the problem for me (my indentation is a bit different), which can be verified by running the following piece:

fixed.json.output.format <- list(
  mode = "text", 
  format = function (kv, con) {
    ser = function(k, v) {
      paste(
        gsub("\n", "", toJSON(k, .escapeEscapes = TRUE, collapse = "")), 
        gsub("\n", "", toJSON(v, .escapeEscapes = TRUE, collapse = "")), 
        sep = "\t")
    }
    out = rmr2:::reduce.keyval(kv, ser, write.size)
    writeLines(paste(out, collapse = "\n"), sep = "\n", con = con)
  }, 
  streaming.format = NULL, 
  backend.parameters = NULL
)

ttt1 <- to.dfs(
  keyval(1:1000, 1000:1), 
  format = make.output.format(format = "csv", mode = "text", sep="\t"))

ttt4 <- mapreduce(
  input = ttt1, 
  map = to.map(identity), 
  input.format = make.input.format(format = "csv", mode = "text", sep="\t", nrows = 10), 
  output.format = fixed.json.output.format)

ttt3 <- from.dfs(input = ttt4, format = "json")

(If I knew how to build from source, test, submit and make pull requests I would do that, but without knowing my only option is to submit my big reports and suggested fixes in this form).

piccolbo commented 10 years ago

Can repro. Not absolutely sure about the fix, will look into it. Great analysis, very helpful. To build

R CMD build path-to-pkg

to test

R CMD check path-to-pkg

For a pull request, there is a top-ish left-ish green button on the main project page near the branch selector. That would be going the whole nine yards, but this report is more than helpful already. If you can't build from source, you have to wait for a release, so if you want the fix quick, even if I push it today, you have to build from source.

saargolde commented 10 years ago

The error behaves differently when you have a trivial mapper, of the form:

  map = function(k,v) {
    myKeys <- v[, 1]
    myVals <- v[, -1]
    return(keyval(myKeys, myVals))
  }

Here you no longer have errors in reading from.dfs, but the results get messed up (some observations get lost due to missing newline before they start). This is closer to the problem that inspired this bug report and suggested fix, the data for which I can't submit directly...

piccolbo commented 10 years ago

I tried your fix, but I am not convinced the output is correct

> cat(readLines("/tmp/file6284530432c9/part-00000"))
[ null ]    { "V1": [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ],"V2": [ 1000, 999, 998, 997, 996, 995, 994, 993, 992, 991 ] } [ null ]    { "V1": [ 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ],"V2": [ 990, 989, 988, 987, 986, 985, 984, 983, 982, 981 ] } [ null ]    { "V1": [ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ],"V2": [ 980, 979, 978, 977, 976, 975, 974, 973, 972, 971 ] } [ null ]    { "V1": [ 31, 32, 33, 34, 35, 36, 37, 38, 39, 40 ],"V2": [ 970, 969, 968, 967, 966, 965, 964, 963, 962, 961 ] } [ null ]    { "V1": [ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 ...

There should be a newline before each [null] I think. You need to scroll the above to see what I mean.

piccolbo commented 10 years ago

Actually, no, you are right, the output is correct, if you inspect it properly.

piccolbo commented 10 years ago

It seems to me there is the same error in the text output format.

saargolde commented 10 years ago

I think my original example is confounding two problems: the behavior of the json output in the absence of keys when the input is csv format, and the missing newline at the end of each output 'chunk'.

piccolbo commented 10 years ago

OK so we've got the second one under control. Let's split the first one in reading csv and writing json. Which one is the problem?

On Tue, Aug 19, 2014 at 10:28 AM, Saar Golde notifications@github.com wrote:

I think my original example is confounding two problems: the behavior of the json output in the absence of keys when the input is csv format, and the missing newline at the end of each output 'chunk'.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/136#issuecomment-52668817 .

saargolde commented 10 years ago

After thinking about it more, I'm not sure the json output format's behavior is a real problem.

Initially I found it a bit weird that I started out with a keys and values, and after saving the data as csv and converting to json using the identity mapper I lost the keys and arrived at a collection of entries, each including V1 and V2... And not only were the keys lost, the resulting data is a collection of json objects that each has a couple of arrays whose length depend on the size of the data chunk read by the input format.

In retrospect I should not have expected all these transformation to know / guess what was a key and what was values 'on their own' so that consistency was maintained. I should not be surprised that when I don't tell a computer how to interpret the data it interprets it in a way that seems a bit weird to me.

In summary, I think that the only problem was the "\n" missing... everything else was in my head.

piccolbo commented 10 years ago

No problem, so are you going to build from master and give it a try or wait for the next release?

On Tue, Aug 19, 2014 at 12:01 PM, Saar Golde notifications@github.com wrote:

After thinking about it more, I'm not sure the json output format's behavior is a real problem.

Initially I found it a bit weird that I started out with a keys and values, and after saving the data as csv and converting to json using the identity mapper I lost the keys and arrived at a collection of entries, each including V1 and V2... And not only were the keys lost, the resulting data is a collection of json objects that each has a couple of arrays whose length depend on the size of the data chunk read by the input format.

In retrospect I should not have expected all these transformation to know / guess what was a key and what was values 'on their own' so that consistency was maintained. I should not be surprised that when I don't tell a computer how to interpret the data it interprets it in a way that seems a bit weird to me.

In summary, I think that the only problem was the "\n" missing... everything else was in my head.

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/136#issuecomment-52682522 .

saargolde commented 10 years ago

I'll wait till the next release. I need my cluster admin to do that anyway, and for now I have my workaround code working.

saargolde commented 10 years ago

I believe the change was not needed for the text format output. I ran into problems with the 'fixed' version. The reason the change is not needed in the text output format is that every line ends with a "\n" because of the 'paste' command, something that is not the same as in the json format. It'll take me some time to come up with a simple example that shows it...

piccolbo commented 10 years ago

You are right and I think the deep reason is that that paste is superfluous, writeLines knows what to do. So I simplified the function a bit. Thanks for the code review.