Tencent / rapidjson

A fast JSON parser/generator for C++ with both SAX/DOM style API
http://rapidjson.org/
Other
14.18k stars 3.53k forks source link

"18446744073709551617" is parsed as double and loses percision #661

Open thedrow opened 8 years ago

thedrow commented 8 years ago

Apparently Python is able to store such large numbers and while running tests I encountered this issue. When converted to python's Decimal I get the following number: Decimal(1.8446744073709552e+19) == Decimal('18446744073709551616') which means it's off by one. This is possible either because of floating point numbers (likely) or due to a bug in our numbers parsing algorithm. There are multiple things we can do:

  1. We can expose long long, signed long long and unsigned long long handlers.
  2. Both GCC and CLang provide int128 and uint128 which we can use for really big numbers.
  3. We can use one of the arbitrary big integer libraries out there.
  4. We can pass a boolean to Reader's Double() that indicates if that this number is or isn't a number.
andrusha97 commented 8 years ago

Also you can use kParseNumbersAsStringsFlag and handle big numbers however is suitable for your use case. Though this approach seems to have certain limits too.

pah commented 8 years ago

By default, RapidJSON uses a very fast, but in some corner cases inaccurate algorithm to parse floating point numbers, see Parsing to Double for more information.

Does it make a difference to pass the kParseFullPrecisionFlag to the Parse call?

thedrow commented 8 years ago

All of these are good suggestions but do you see a reason not to support (unsigned)long long or __(u)int128 on platforms that support it? The number itself isn't a double per se. It's still an integer. It's simply to big to fit into an unsigned long integer.

miloyip commented 8 years ago

I can list a few:

  1. GenericValue is a variant type. It has been optimized to 16 bytes for 32-bit and 64-bit architecture. Storing 128-bit integer would requires 32 bytes for 128-bit alignment.
  2. Many JSON tools are unable to deal with 64-bit integers (e.g. just using double for storing all numbers). Supporting 64-bit integer is already a surplus, as needed for many 64-bit applications. RapidJSON explicitly handle integers are also related to performance consideration.
  3. kParseNumbersAsStringsFlag was proposed and developed by community users, for dealing with arbitrary precisions.
  4. Python is designed to support arbitrary precision numbers. This decision makes it easy for users but have a price on performance. And most non-scripting languages (e.g. C/C++) do not support this in the language level. Some users may argue that 256-bit integer or 128-bit floating point or currency types are needed in their cases. This can be handled by point 3 if the user really needs it.
thedrow commented 8 years ago

What if we dynamically allocate (unsigned/signed)long long/__(u)int128. That would not require an extra 32 bit. In fact, this can be stored as void *. When we encounter a large enough integer we'll allow the user to handle it and store it manually.

Converting all numbers from strings will increase my overhead by a lot (I haven't measured yet, but it's certain there will be significant overhead). If I could avoid that by providing support for bigger integers myself that would help binding to other languages.

jschultz410 commented 5 years ago

I have a slightly different, but related concern: I need to know if an integer in the JSON text can't be represented without loss of precision.

I really don't want JSON conversions silently introducing errors into my data. I'd much rather be able to detect it somehow and reject the data. For example, I'd much rather the parser reject the integer 18446744073709552000 than interpret it as a double while losing something like 12 bits of precision.

I'm already using kParseFullPrecisionFlag. Is there a way to tell the parser to generate an error if a parsed integer value can't be represented exactly? If there isn't, then should there be?

Or is there a way to ask a Number Value if it lost precision while being parsed?