eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
115 stars 13 forks source link

fparse fails when set with 6 decimal points is followed by set with string value #93

Closed rjonesey closed 1 month ago

rjonesey commented 1 month ago

Encountered an issue where this specific json won't parse when there are 6 decimal places followed by a string value. It will successfully parse when there five numbers after the decimal.

Please let me know if there is anything else I can do to help.

Thank you

RcppSimdJson::fparse(
  r"({
  "tags":
  [
    {
      "value":0.123456
    },
    {
      "value":"test"
    }
  ]
})"
)
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : basic_string

RcppSimdJson::fparse(
  r"({
  "tags":
  [
    {
      "value":0.12345
    },
    {
      "value":"test"
    }
  ]
})"
)
#> $tags
#>      value
#> 1 0.123450
#> 2     test

Created on 2024-07-16 with reprex v2.0.2

eddelbuettel commented 1 month ago

Thanks, can reproduce. But now scratching my head as I cannot find (substrings of) the error message in either our glue code in src/ or in the included simdjson in inst/include.

@lemire Any quick guess where the string length constraint may come from?

PS 0: On my system the error is

> Error in .deserialize_json(json = json, query = query, empty_array = empty_array,  : 
  basic_string::erase: __pos (which is 9) > this->size() (which is 8)
> 

PS 1: Oh I see now. This is specific to text columns. If it is all numbers we have no issue:

> RcppSimdJson::fparse(
  r"({
  "tags":
  [
    {
      "value":0.123456
    },
    {
      "value":1.234567
    }
  ]
})"
)
. + > $tags
     value
1 0.123456
2 1.234567

> 

That opens more avenue for mischief: maybe it is Rcpp forming the column vector? Still can't locate the error though.

lemire commented 1 month ago

@eddelbuettel The simdjson library itself avoids std::string as much as possible.

eddelbuettel commented 1 month ago

We use them / need them before materializing to R.

@rjonesey I don't think there will be a simple or quick fix. Sorry.

lemire commented 1 month ago

https://github.com/eddelbuettel/rcppsimdjson/blob/ad94a13ad5c5eab1274758cd4e936576b8b1613c/inst/include/RcppSimdJson/deserialize/scalar.hpp#L95

lemire commented 1 month ago

I don't think there will be a simple or quick fix. Sorry.

You might be overly pessimistic. :-)

lemire commented 1 month ago

I'll prepare a PR.

eddelbuettel commented 1 month ago

Wowser. Nice catch!!

JosiahParry commented 1 month ago

You all rock! Thank you so much. And thank you @rjonesey for filing an issue.

This was discovered while we were testing new functionality in arcgislayers::query_layer_attachments() to access image metadata. I will update the minimum required version of RcppSimdJson when a PR is made.

Thanks again for this wonderful package!

rjonesey commented 1 month ago

Thanks everyone for the quick work.

Apologies, did not mean to close the issue on comment.

lemire commented 1 month ago

https://github.com/eddelbuettel/rcppsimdjson/pull/94

eddelbuettel commented 1 month ago

Tagged as micro-release 0.1.12.1 which you can install from the repo or (within the hour) from the r-universe page.

rjonesey commented 1 month ago

The micro-release is now causing a new error with backslashes.

  RcppSimdJson::fparse(
    r"({
  "description":"\u0000"
})"
  )
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : Embedded NUL in string.

  RcppSimdJson::fparse(
    r"({
  "description":"\u"
})"
  )
#> Error in .deserialize_json(json = json, query = query, empty_array = empty_array, : STRING_ERROR: Problem while parsing a string

  RcppSimdJson::fparse(
    r"({
  "description":"u0000"
})"
  )
#> $description
#> [1] "u0000"

Created on 2024-07-18 with reprex v2.0.2

lemire commented 1 month ago

Can R support null chars in its strings?

eddelbuettel commented 1 month ago

I think so -- utf-8, being newer, is available but you may need to enable it via a call to encoding. In the Rcpp package we have a few simple unit tests (and had them for a decade+):

// [[Rcpp::export]]
String test_String_ctor_encoding2() {
    String y("å");
    y.set_encoding(CE_UTF8);
    return y;
}
eddelbuettel commented 1 month ago

Turns out that these days it also works without the encoding call. Welcome to 2024.

> Rcpp::cppFunction('String doesThisWork() { auto s = String{"å"}; s.set_encoding(CE_UTF8); return s; }')
> doesThisWork()
[1] "å"
> Rcpp::cppFunction('String doesThisWork() { auto s = String{"å"}; return s; }')
> doesThisWork()
[1] "å"
> Rcpp::cppFunction('std::string doesThisWork() { auto s = std::string{"å"}; return s; }')
> doesThisWork()
[1] "å"
> 
lemire commented 1 month ago

@eddelbuettel How do you feel about what this guy wrote in 2021?

Capture d’écran, le 2024-07-18 à 16 36 55

It seems that he thought that null characters were not allowed. I have no idea whether he was right.

eddelbuettel commented 1 month ago

Was that 2021 post from our repo here?

I am a little fuzzy on null characters -- but they are allowed in some strings. It may require a raw string though (so we would have to convert to and from). This comes up sometimes when serializing / deserializing. Digging ...

eddelbuettel commented 1 month ago

Here is something going to same direction (i.e. raw): https://stackoverflow.com/a/55006350/23224962

But yes, R being a C-based product sticks with C vectors so nul-terminated strings make it hard/impossible to assign it to standard character variables.

lemire commented 1 month ago

Was that 2021 post from our repo here?

Yes. @rjonesey's comment was once an issue that you closed...

https://github.com/eddelbuettel/rcppsimdjson/issues/68

lemire commented 1 month ago

Or maybe the reporter closed it:

Capture d’écran, le 2024-07-18 à 17 01 15
lemire commented 1 month ago

It is just not a bug, it is a specification issue. @rjonesey is trying to load a JSON which has a string with a nul in it. Yet R does not accept strings with nuls.

So what do you do?

Note that the JSON specification is very clear, it does allow nuls. It seems that R is very clear that it does not.

eddelbuettel commented 1 month ago

The key issue is still the R parser. You can't do it at the command-line:

> a <- "\u0000"
Error: nul character not allowed (<input>:1:12)
>

As demonstrated in #68, "below" R in an extension library such as simdjson we can of course do whatever we want. But there is an issue in how you'd prepare any such results for return back to R in particular in an R application such as a JSON parser expected to return strings as you cannot form such strings.

So I think if we wanted to support this we'd have to "invent" a new representation. Which will of course not be portable as it will be our one-off.

So I fear the standard answer of "there are limits in what we can parse as there are limits in what we can return" still holds.

lemire commented 1 month ago

@rjonesey Given that R’s strings disallow nuls, can you elaborate on what your application is doing?

rjonesey commented 1 month ago

Sure, I'll do my best here. Please let me know if the details I provide are insufficient. The fparse function is attempting to parse a response that contains a lot of metadata from images in a cloud service. The images are from hundreds of user uploads from their mobile device using an application called Survey123, an Esri product. These images are stored in an ArcOnline feature service, a sort of cloud object custom to ArcOnline. I am attempting to load the URLs of the images from the ArcOnline feature layer to then download the images themselves. I do not know why the metadata in the images contains so many nul characters. I think it has to do with the camera/phone settnigs of the individual users, as I can select specific records, and obtain the photos, while others fail. In the current function, if one photo fails the entire function fails.

eddelbuettel commented 1 month ago

For completeness here is what the the widely-used (yet slower) jsonlite says:

> library(jsonlite)
> fromJSON(r"( { "description":"\u0000" } )")
$description
[1] ""

> fromJSON(r"( { "description":"\u" } )")
Error: lexical error: invalid (non-hex) character occurs after '\u' inside string.
                     { "description":"\u" } 
                     (right here) ------^
> fromJSON(r"( { "description":"u0000" } )")
$description
[1] "u0000"

> 
rjonesey commented 1 month ago

@JosiahParry may be able to speak more to the details and functionality. He developed the function and package being used. Thank you all.

rjonesey commented 1 month ago

For completeness here is what the the widely-used (yet slower) jsonlite says:

> library(jsonlite)
> fromJSON(r"( { "description":"\u0000" } )")
$description
[1] ""

> fromJSON(r"( { "description":"\u" } )")
Error: lexical error: invalid (non-hex) character occurs after '\u' inside string.
                     { "description":"\u" } 
                     (right here) ------^
> fromJSON(r"( { "description":"u0000" } )")
$description
[1] "u0000"

> 

Very odd. The speed up with rcppsimdjson is nice. My example is likely flawed in the second part with the \u I was going for simplicity of example, but I do not know if there are any standalone \u strings in the metadata. Reading it as an empty string as it does in the first example would be acceptable for my specific use case. But I do not know if that is acceptable here.

eddelbuettel commented 1 month ago

This may end up being a local deployment problem for you. I searched a little -- and some answer out there show using an argument 'skipNul' (or alike) where is exists (readLines, similar in data.table::fread) or advocate for a quick filter over the input data.

lemire commented 1 month ago

@rjonesey

The micro-release is now causing a new error with backslashes. But I do not know if that is acceptable here.

The micro-release is not related to this new issue. In my view, it is not a bug in rcppsimdjson. The problem is that you are providing JSON data with nuls in its strings. So \u0000 is valid in JSON but it cannot be represented inside an R string. Thus it is an error. This is uncommon in my experience and likely an indication that your system is buggy. And if you really are meant to have nuls in your systems, and to ingest them in R, then I submit to you that you have to design issue. As far rcppsimdjson is concerned, I think we want an error because there is something off.

If you want the parser to act as if the \u0000 were not present, then you can include in your code a fast filtering function that removes instances of \u0000. This can be done at very high speed.

eddelbuettel commented 1 month ago

100% agree. It is preferable to error out on unpresentable input than to continue and pretend the data were different.

JosiahParry commented 1 month ago

Thanks all for the very thoughtful conversation! As mentioned previously, we are using RcppSimdJson as our primary way of processing json from many different services. https://github.com/R-ArcGIS/

In this case, we have a product called Survey123 which is used to collect data from the "field" (anywhere in the world collected via phones and other devices). We are parsing metadata about imagery that is collected from Survey123 using RcppSimdJson.

Many folks are capturing drone imagery and the metadata is in the exif format. So much of wonkiness is actually coming from the drone rather than anything else along in the process.

In this case, our workaround is to make returning metadata optional which will address @rjonesey's needs.

If this becomes a more persistent issue, I will see if I can find a lightweight R package—or make one—that can address embedded null characters in the JSON string.

Again, and as always, thank you for the great work and being so helpful!