dmeranda / demjson

Python module for JSON data encoding, including jsonlint. See the project Wiki here on Github. Also read the README at the bottom of this page, or the project homepage at
http://deron.meranda.us/python/demjson/
Other
302 stars 76 forks source link

jsonlint changes foat values #24

Closed technimad closed 4 years ago

technimad commented 7 years ago

I'm seeing the following unexpected behaviour:

I have a datastructure which contains floats, like

"settings":{"max":108.9}

After running jsonlint -s this gets changed to

"settings" : { "max" : 108.90000000000001 }

which is a different value and causes a subsequent run of jsonlint to report the warning: Warning: Floats larger or more precise than an IEEE "double" may not be portable

The parameter --keep-format doesn't influence this behaviour.

I would expect the linter not to change the value of floats and generate output which would not contain warnings when run through the linter again.

dmeranda commented 7 years ago

I'm not seeing that myself with the current version (2.2.4),

$ echo '{"max":108.9}' | jsonlint -f -s
{ "max" : 108.9 }

Can you report the jsonlint version and other platform information? The output from running:

jsonlint --version -v

Also you might try adding the --stats option to jsonlint and see if it is reporting any IEEE floating point overflows,, underflows, etc. Thanks

technimad commented 7 years ago

Demo of the issue on my system

$ echo '{"max":108.9}' | jsonlint -f -s
{ "max" : 108.90000000000001 }

Jsonlint version

$ jsonlint --version -v
jsonlint (demjson) version 2.2.4 (2015-12-22)
demjson from '/xxxREDACTEDxxx/lib/demjson-master/demjson.pyc'
Python version: 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)]
This python implementation supports:
  * Max unicode: U+10FFFF
  * Unicode version: 5.1.0
  * Floating-point significant digits: 16
  * Floating-point max 10^exponent: 308
  * Floating-point has signed-zeros: Yes
  * Decimal (bigfloat) support: Yes

My system is a RHEL 6.6 machine

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.6 (Santiago)
$ uname -a
Linux vdi7653 2.6.32-504.46.1.el6.x86_64 #1 SMP Sun Feb 28 13:45:01 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

Detailed stats info from jsonlint, rerunning the output back into jsonlit

$ echo '{"max":108.9}' | jsonlint -f -s --stats
<stdin>: ----- Begin JSON statistics
   | Number of integers:
   |     8-bit:         0   (-128 to 127)
   |    16-bit:         0   (-32768 to 32767)
   |    32-bit:         0   (-2147483648 to 2147483647)
   |  > 53-bit:         0   (-9007199254740991 to 9007199254740991 - overflows JavaScript)
   |    64-bit:         0   (-9223372036854775808 to 9223372036854775807)
   |  > 64 bit:         0   (not portable, may require a "Big Num" package)
   |    total ints:     0
   |    Num -0:         0   (negative-zero integers are not portable)
   | Number of floats:
   |    doubles:        1
   |  > doubles:        0   (will overflow IEEE doubles)
   |    total flts:     1
   |    Num -0.0:       0   (negative-zero floats are usually portable)
   | Number of:
   |    nulls:          0
   |    booleans:       0
   |    arrays:         0
   |    objects:        1
   | Strings:
   |    number:             1 strings
   |    max length:         3 characters
   |    total chars:        3 across all strings
   |    min codepoint: U+0061  (LATIN SMALL LETTER A)
   |    max codepoint: U+0078  (LATIN SMALL LETTER X)
   | Other JavaScript items:
   |    NaN:             0
   |    Infinite:        0
   |    undefined:       0
   |    Comments:        0
   |    Identifiers:     0
   | Max items in any array:     0
   | Max keys in any object:     1
   | Max nesting depth:          1
   | Unnecessary whitespace:     1 of 14 characters (7.14%)
<stdin>: ----- End of JSON statistics
{ "max" : 108.90000000000001 }
$ echo '{ "max" : 108.90000000000001 }' | jsonlint -f -s --stats
<stdin>:1:10: Warning: Floats larger or more precise than an IEEE "double" may not be portable
   |  At line 1, column 10, offset 10
<stdin>: ----- Begin JSON statistics
   | Number of integers:
   |     8-bit:         0   (-128 to 127)
   |    16-bit:         0   (-32768 to 32767)
   |    32-bit:         0   (-2147483648 to 2147483647)
   |  > 53-bit:         0   (-9007199254740991 to 9007199254740991 - overflows JavaScript)
   |    64-bit:         0   (-9223372036854775808 to 9223372036854775807)
   |  > 64 bit:         0   (not portable, may require a "Big Num" package)
   |    total ints:     0
   |    Num -0:         0   (negative-zero integers are not portable)
   | Number of floats:
   |    doubles:        0
   |  > doubles:        1   (will overflow IEEE doubles)
   |    total flts:     1
   |    Num -0.0:       0   (negative-zero floats are usually portable)
   | Number of:
   |    nulls:          0
   |    booleans:       0
   |    arrays:         0
   |    objects:        1
   | Strings:
   |    number:             1 strings
   |    max length:         3 characters
   |    total chars:        3 across all strings
   |    min codepoint: U+0061  (LATIN SMALL LETTER A)
   |    max codepoint: U+0078  (LATIN SMALL LETTER X)
   | Other JavaScript items:
   |    NaN:             0
   |    Infinite:        0
   |    undefined:       0
   |    Comments:        0
   |    Identifiers:     0
   | Max items in any array:     0
   | Max keys in any object:     1
   | Max nesting depth:          1
   | Unnecessary whitespace:     5 of 31 characters (16.13%)
<stdin>: ----- End of JSON statistics
{ "max" : 108.90000000000001 }
dmeranda commented 7 years ago

Thanks for the output. The only thing that stands out at the moment is that it's using Python 2.6, which is quite ancient. Though demjson should still work with 2.6, I wonder if it has some slight difference in it's floating point operations. I've just tested with python 2.7 and don't see the problem. I'll need to set up a specific test environment that more closely matches yours to track this down. There's even a chance this could be a lower level "libc" issue too as RHEL 6.6 has quite older system libraries than I have.

In the mean time, is it possible for you to install a newer python environment (2.7, or even 3.*) for running demjson/jsonlint?

technimad commented 7 years ago

Sorry, on this particular environment I cannot install anything.

Op 15 nov. 2016 om 20:21 heeft Deron Meranda notifications@github.com het volgende geschreven:

Thanks for the output. The only thing that stands out at the moment is that it's using Python 2.6, which is quite ancient. Though demjson should still work with 2.6, I wonder if it has some slight difference in it's floating point operations. I've just tested with python 2.7 and don't see the problem. I'll need to set up a specific test environment that more closely matches yours to track this down. There's even a chance this could be a lower level "libc" issue too as RHEL 6.6 has quite older system libraries than I have.

In the mean time, is it possible for you to install a newer python environment (2.7, or even 3.*) for running demjson/jsonlint?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

dmeranda commented 7 years ago

This issue is seemingly caused by Python 2.6. Notice the difference (all run on the same linux OS, so it's not a libc thing):

# Python 2.6.9
>>> repr( 108.9 )
'108.90000000000001'

# Python 2.7.3
>>> repr( 108.9 )
'108.9'

# Python 3.5.1
>>> repr( 108.9 )
'108.9'

Although demjson implements a complete custom parser for the input of JSON numbers, for output (generating JSON) it simply relies on Python's builtin repr() function to format floats or Decimals. This is because Python's repr() is always a legal JSON number (excluding NaNs and infinities).

Note that for most IEEE 754 floating point double-precision numbers (which the linux C implementation of Pyhon uses), there is approximately 16 significant decimal digits. However floats are represented in binary with a 53-bit significand, which is not an exact number decimal digits. So to output a number a compromise must be made in terms of whether to round to avoid partial digits, or not to preserve all the binary bits even if a partial (ambiguous) digit must be output.

In any version of Python the number 108.9 is actually 108.900000000000005684341... but those last digits from the "5" onward are not exact. In fact you'll find that 108.9 == 108.9000000000000056. Even if you round to one less significant digit you'll see the difficulty with decimal representations:

108.90000000000000 == 108.90000000000001  # True
108.90000000000001 == 108.90000000000002  # False
108.90000000000002 == 108.90000000000003  # False
108.90000000000003 == 108.90000000000004  # True

This is why demjson produces a lint warning when it sees the number 108.90000000000001, as it is not truly portable—notwithstanding that some JSON implementations may not even use IEEE 754 at all.

Generally, using 15 significant digits will insure that the decimal representations survive a round-trip conversion between string and numeric format. While using 17 significant digits insures that the binary representation survives a round trip. (This assumes that IEEE subnormal forms aren't used, which Python does not employ)

The reference C implementations of Python 2.6 and Python 2.7 render a different number of significant decimal digits. I don't know what other implementations (PyPy, Jython, etc) do.

dmeranda commented 7 years ago

Since Python 2.6 reached end-of-life in 2013 and I see no easy solution to this that isn't risky in terms of potentially introducing further bugs, I am unlikely to provide a patch for this issue. Though certainly feel free to fork and/or provide a pull request — or convince me this needs fixed.

Still, there are some things you may be able to do if you have to use Python 2.6.

Choice 1 — If you can edit the demjson.py source (or make a local copy), you may be able to change the code which outputs floating point values. In version 2.2.4 in demjson.py line 4038 you'll see:

else:
    # A normal float.                                                                                                                  
    state.append( repr(n) )

you may be able to substitute the repr() call with something more like this (not tested!):

else:
    s = "%.16g" % n
    if "e" not in s and "." not in s:    # [edited]
        s = s + ".0"
    state.append( s )

Note that repr() is similar to the %g format specifier, but unlike %g it insures there is always a fractional part.

Choice 2 — You can tell json lint to ignore the excess significant digits and not output a warning,

jsonlint --allow  non-portable

though this will also suppress other kinds of warnings about non-portable data too.

dmeranda commented 7 years ago

Just an FYI for completeness, the built-in JSON module in python (aka simplejson) exhibits the same issue:

# Python 2.6
>>> import json
>>> json.dumps(108.9)
'108.90000000000001'

# Python 2.7 through 3.5
>>> import json
>>> json.dumps(108.9)
'108.9'
technimad commented 7 years ago

Many thanks for your in depth investigation and possible solutions! Reading your explanation i'm convinced this doesn't need fixing as it only happens when using an eol version of python.

I'll look into your suggestions to solve this for our particular use case.