from_ndjson - Githubissues

SymbolixAU commented 5 years ago

[x] read StringVector
[x] stream a file into a &Document
[x] tests
[ ] test using a rapidjson::Domrather than streaming the string with << '[' << ndjson << ']', then std::replace( json.begin(), json.end(), '\n', ',')
[ ] ~~update reverse imports (mapdeck, geojsonsf, spatialwidget) to use R_xlen_t values before release to CRAN.~~

dcooley commented 4 years ago

on branch issue58 (which inculdes to_ndjson)

sheffe commented 4 years ago

@dcooley I just installed this per your comment in #29 but something is erroring out quickly.

Pretty minimal reprex:

# remotes::install_github("SymbolixAU/jsonify", "issue58")
jsonify::from_ndjson('{"abc":123}')

fails with error: Error in rcpp_from_ndjson(json, simplify, fill_na) : object 'json' not found

I think this one's pretty straightforward. There are some switches in argument names between ndjson and json at various points. This example seems to be erroring here inside ndjson_to_r.character, at the return line: return(rcpp_from_ndjson( json, simplify, fill_na )). This affects raw strings and filepaths, and connections wrapped in file("pathtofoo.ndjson").

Edit: just forked to make this tweak and ran into more issues. Specifically, calling from_ndjson with json switched to ndjson in that return statement:

works on the simple '{"abc":123}' case
works for readLines(foo, 1) where 'foo' is every valid ndjson file I have on my hard drive
fails for all multi-row cases eg (readLines(foo, 2)) with the error below

To reproduce from my fork,

jsonify::from_ndjson(ndjson = '{"abc":123}')
jsonify::from_ndjson(paste0(c('{"abc":123}', '{"def":456}', collapse = "\n")))

with results:

> from_ndjson(paste0(c('{"abc":123}', '{"def":456}', collapse = "\n")))
Error in rcpp_from_ndjson(ndjson, simplify, fill_na) : 
  Expecting a single string value: [type=character; extent=3].
In addition: Warning messages:
1: In if (is_url(ndjson)) { :
  the condition has length > 1 and only the first element will be used
2: In if (file.exists(ndjson)) { :

 Error in rcpp_from_ndjson(ndjson, simplify, fill_na) : 
  Expecting a single string value: [type=character; extent=3].

dcooley commented 4 years ago

I've also messed up my branches so I'm now going through and fixing a load of branch conflicts.

However, your paste0 statement is creating a vector of strings

paste0(c('{"abc":123}', '{"def":456}', collapse = "\n"))
[1] "{\"abc\":123}" "{\"def\":456}" "\n"

Is this what you intended?

It looks like you may have the closing bracket in the wrong place?

paste0(c('{"abc":123}', '{"def":456}'), collapse = "\n")
[1] "{\"abc\":123}\n{\"def\":456}"

sheffe commented 4 years ago

Yikes, not my finest moment on Github. You're correct, closing paren in the wrong spot, so this statement now works with the fixed argname json to ndjson described above: jsonify::from_ndjson(paste0(c('{"abc":123}', '{"def":456}'), collapse = "\n"))

On larger test files that I've written with to_ndjson in the past few days, I'm still seeing errors in from_ndjson, but I can't reproduce these easily on testcases at the moment. The crux of the issue is that this works fine:

lapply(readLines("somefile.ndjson"),
       function(x) from_ndjson(ndjson = x))

but this fails:

from_ndjson("somefile.ndjson")

with Error in rcpp_read_ndjson_file(normalizePath(ndjson), get_download_mode(), : json parse error

I can't see why each individual line succeeds but parsing the entire file fails. When I try to reproduce the error using a simple case,

test_list <- from_ndjson(paste0(c('{"abc":123}', '{"def":456}'), collapse = "\n"))
writeLines(to_ndjson(test_list), "testlist.ndjson")
from_ndjson("testlist.ndjson")

this also works fine.

So I need to figure out the structural difference between these files to reproduce. Will write back when I do. (This time without dumb errors!)

dcooley commented 4 years ago

Can you check how the ndjson is separated, is it by \n, or perhaps \r\n?

sheffe commented 4 years ago

edit: clarified this and not a problem with 58, working as intended

Ha, I can finally reproduce. This was nasty. Still don't have a good idea why it's happening.

When I wrote the files throwing json parse error calling from_ndjson, I was running SymbolixAU/jsonify@issue29. For that branch, writeLines(to_ndjson(df), "file.ndjson") with the default sep="\n" causes a trailing empty line in "file.ndjson" that causes the parse error, but every individual line was valid ndjson. I re-wrote the exact same data today after installing SymbolixAU/jsonify@issue58_fix -- still using writeLines(..., sep="\n") and the same code -- and found that the second presumed-identical file didn't cause a parse error.

The two files -- 1 written with to_ndjson yesterday on issue29, and 2 today on issue58_fix -- pass an all.equal() test once they're in R, but they have different MD5 hashes and readLines() on the two files comes up with different vector lengths (off by 1). The only difference is the empty trailing line for the file written from jsonify@issue29.

I discovered this on a production file, which is a deeply nested dataframe with list-cols. I don't see why the deeply-nested part would cause this, but until I can get a more minimal reprex together I just included an anonymized variant of the data.

(Github won't let me upload .rds directly, so I guess unzip this first.) testcase.rds.zip

remotes::install_github("SymbolixAU/jsonify", "issue29", force = TRUE)
testcase <- readRDS("testcase.rds")
writeLines(jsonify::to_ndjson(testcase), "issue29.ndjson")

remotes::install_github("SymbolixAU/jsonify", "issue58_fix", force = TRUE)
testcase <- readRDS("testcase.rds")
writeLines(jsonify::to_ndjson(testcase), "issue58.ndjson")

openssl::md5(file("issue29.ndjson"))
openssl::md5(file("issue58.ndjson"))

length(readLines("issue29.ndjson"))
length(readLines("issue58.ndjson"))

library(jsonify)
foo <- from_ndjson("issue58.ndjson")
bar <- from_ndjson("issue29.ndjson")
bar <- from_ndjson(paste0(readLines("issue58.ndjson", nrow(foo)), collapse = "\n"))
all.equal(foo, bar)

Seriously thought I was hallucinating halfway through this.

dcooley commented 4 years ago

from_ndjson on this list objet returns the elements one-level deeper. This probably isn't right.

  lst <- list(
    x = 1:5
    , y = list(
      a = letters[1:5]
      , b = data.frame(i = 10:15, j = 20:25)
    )
  )

from_ndjson( "{\"x\":[1,2,3,4,5]}\n{\"y\":{\"a\":[\"a\",\"b\",\"c\",\"d\",\"e\"],\"b\":[{\"i\":10,\"j\":20},{\"i\":11,\"j\":21},{\"i\":12,\"j\":22},{\"i\":13,\"j\":23},{\"i\":14,\"j\":24},{\"i\":15,\"j\":25}]}}" )

SymbolixAU commented 4 years ago

^^ Actually, this is probably correct, because we are splitting the list up during the to_ndjson() phase, and each 'row' of JSON has to become a valid object. Therefore, to re-combine it we have to turn the whole lot into an array.

In the future, if anyone specifically asks or this is an issue, we may have to come up with rules for handling

{"x":[1,2,3,4,5]}
{"y":{"a":["a","b","c","d","e"],"b":[{"i":10,"j":20},{"i":12,"j":22}]}}

where we would need to remove the trailing }\n from the previous row, and the leading "{" from the next row

SymbolixAU commented 4 years ago

Going to CRAN because I need to update the package to set stringsAsFactors = FALSE by default ( in my tests ) to comply with R 4.0.0, but keeping open to test the one remaining check box.

SymbolixAU / jsonify

from_ndjson #58