json_parse: NUMBER_ERROR: Problem while parsing a number

mbasmanova commented 11 months ago

Bug description

TEST_F(JsonFunctionsTest, jsonParse) {
  const auto jsonParse = [&](std::optional<std::string> value) {
    return evaluateOnce<StringView>("json_parse(c0)", value);
  };

  auto x = jsonParse("[233897314173811950000]");
}

VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Problem while parsing a number
Retriable: False
Context: json_parse(c0)

presto> select json_parse('[233897314173811950000]');
          _col0
-------------------------
 [233897314173811950000]
(1 row)

Meta: 20231215_004903_00428_65rj9

CC: @kevinwilfong @amitkdutta @kagamiori @kgpai @bikramSingh91 @duanmeng @laithsakka

System information

n/a

Relevant logs

No response

mbasmanova commented 11 months ago

CC: @Yuhta

mbasmanova commented 11 months ago

lemire commented 11 months ago

In standard C++, we have 64-bit, 32-bit, 16-bit and 8-bit integers and well as 64-bit, 32-bit, and 16-bit floating-point integers. None of them can represent the value provided.

We can try with a 64-bit integer (uint64_t). The number233897314173811950000 which is 0xcadfa3000000001b0, is much larger than what an unsigned 64-bit integer can store, which is 0xffffffffffffffff, so no luck there. There are 128-bit integers as extension of some systems (e.g., GCC), but they are not standard (e.g., won't work with Visual Studio).

You can also parse it as a double (64-bit floating point integer):

#include "simdjson.h"
#include <iostream>

using namespace simdjson;

int main(void) {
  padded_string json = R"( [ 233897314173811950000 ])"_padded;
  ondemand::parser parser;
  ondemand::document doc;
  if(auto error = parser.iterate(json).get(doc); error) { std::cout << error << std::endl; return EXIT_FAILURE; }
  double val;
  if(auto error = doc.at(0).get_double().get(val)) { std::cout << error << std::endl; return EXIT_FAILURE; }
  printf("%.0f\n", val);
  return EXIT_SUCCESS;
}

This will print 233897314173811949568 or 26591046*2**43.

From the string "233897314173811950000", you parse the integer 233897314173811949568. Is that what you wanted? The simdjson library will do it for you if that's what you want, as demonstrated above...

You have to decide what you want to do with it. If you want to store it using your own type, then you can get access to the raw type and then provide your own parser (e.g., to MyDecimalType or to MyBigIntegerType). The simdjson library provides this functionality as well.

What the simdjson library does not provide is a custom big-integer or decimal type, because that's not standard. We do, however, make it easy for you to do your own parsing if you so desire. Please see the relevant section of our documentation.

Capture d’écran, le 2023-12-15 à 08 29 31

mbasmanova commented 11 months ago

@lemire Hi Daniel, thank you for detailed response. I understand what you are saying. This issue came up from a larger query, specifically, from this piece:

CAST(
    "json_parse"(logit) AS array(real)
)

Presto can execute this query successfully, Velox fails during json_parse because of the error above.

presto:di> select cast(json_parse('[233897314173811950000]') as array(real));
     _col0
----------------
 [2.3389731E20]
(1 row)

In the actual query 'logit' is a large array with mostly sane values. I trimmed it down to get a simple repro, hence, there is only one value in the repro and it looks kind of strange. ML users tend to be ok with some amount of garbage in the data and we would like to figure out a way to not fail queries like this. For context, we are trying to migrate existing workloads from Presto Java into Prestissimo and therefore cannot really afford failing existing queries.

mbasmanova commented 11 months ago

@lemire Daniel, a side question, current error is not very informative: "Problem while parsing a number". Would it be possible to change it so it clarifies the problem and maybe includes the "wrong" number in the message? This would make troubleshooting a lot easier.

Yuhta commented 11 months ago

Would it be possible to change it so it clarifies the problem and maybe includes the "wrong" number in the message?

We are caching the error message so this will not be useful for us. A better generic error message like "Number cannot be represented exactly" would be better

Yuhta commented 11 months ago

FYI folly JSON is also giving an overflow exception

Yuhta commented 11 months ago

@lemire Looks like ondemand is working in this case and won't giving error when parsing the top level document. Unfortunately in our use case, we have to use the DOM parser for stricter validation, is there an option in DOM parser to ignore such errors first (same as double fallback in folly, the result is discarded anyway in our code), and we will parse it second time using the ondemand parser when actually extracting the values?

lemire commented 11 months ago

ML users tend to be ok with some amount of garbage in the data and we would like to figure out a way to not fail queries like this.

You should probably parse it as a double (get_double()), right? I am not sure where the error occurs in your system, but I am going to guess that you are first trying to parse it as an integer, because that what it looks like... Normally, a system would serialize such a float value as something like '2.3389731417381195E+20'. Anyhow, it is written as an integer, but won't parse as such. It is hard to see how you can arrive at such a value, but a distinct possibility is that you had a decimal that got serialized. Or maybe it is meant to be a big integer?

https://github.com/facebookincubator/velox/blob/40557f810d011914f27f616b75bf59630b6e0d74/velox/functions/prestosql/SIMDJsonFunctions.h#L96

What you could do, if it looks like an integer, but it cannot be parsed as such (possibly because it overflows), is fallback and parse it as a double (thus losing precision but at least getting some value).

The simdjson library won't do that automatically for you... because that's unsafe in general... Consider:

> Double(UInt64(9223372036854775809)) == 9223372036854775808

So you have an identifier that you write as 9223372036854775809. You parse it as a double, and serialize it back and get something that is equal to 9223372036854775808. If 9223372036854775809 was an identifier, you just accidentally changed its value. Think about ingesting JSON, and then outputting something that is accidentally different from the input !!!

If you know that is fine in your system, then you can parse it as a double... but the library cannot assume that it is.

This is should be an exceptional scenario: if this keeps happening, there it warrants further examination as there might be something more to it. But if it is exceptional, then the fix can be relatively expensive computationally without harm.

Would it be possible to change it so it clarifies the problem and maybe includes the "wrong" number in the message?

You can print the string content where the error occurred, see our documentation. You could also capture the raw_json() string_view and include it in the error message. It may be useful to indicate what you were trying to do at the time (error while parsing string X as a 64-bit signed integer). In fact, just reporting that the error occurred as you were trying to parse the number as an 64-bit signed integer (if that is what is happening here) would be illuminating.

If the string is just made of digits and you tried to parse an integer, then the only possible error is an overflow. There are are few other rules, but they are quite limited. See section 6 of the RFC. Other errors include a leading zero (0123 is not allowed) or a leading + (+1232 is not allowed).

lemire commented 11 months ago

@Yuhta It is possible to build simdjson so that it will skip number parsing... SIMDJSON_SKIPNUMBERPARSING but that may not be what you want. We could create some macro that meets your needs... It is not very difficult engineering.

The failure comes around here...

https://github.com/simdjson/simdjson/blob/ebd09cb2a3ab421f04b404338c01b593ade5a08c/include/simdjson/generic/numberparsing.h#L591

We could add a VELOX-specific macro. Of course, it depends on what exactly you want to do.

(If you are using a single-header version of the library, you will find the same code... it just gets aggregated.)

lemire commented 11 months ago

We have handful of undocumented macros that fix specific problems in specific systems, adding one here would not be a problem.

Yuhta commented 11 months ago

@lemire Thanks for the code point. I think we will also need to optionally have a branch here to do WRITE_DOUBLE if the range falls out of the both int64 and uint64: https://github.com/simdjson/simdjson/blob/ebd09cb2a3ab421f04b404338c01b593ade5a08c/include/simdjson/generic/numberparsing.h#L616

Does that sound ok?

lemire commented 11 months ago

@Yuhta Yep. That'd be fine.

Yuhta commented 11 months ago

@lemire Will you create the change? Or I can submit a PR?

lemire commented 11 months ago

@Yuhta I'd prefer a PR because I am not sure exactly what you want to happen. If you make the changes, you can validate that it does what you want. I can help, but if you look at the code, you will agree, I think, that it should not be difficult.

Yuhta commented 11 months ago

@lemire Thanks I will create a PR then

lemire commented 11 months ago

Ping me if you want me to help!

Yuhta commented 9 months ago

@lemire I managed to solve this issue by using my own tree walker with ondemand parser, so the change in DOM parser is no longer needed

facebookincubator / velox