doloopwhile / pyjq

A Python binding for ./jq
MIT License
195 stars 30 forks source link

Integers Becoming Floats Beyond 31 Bits #24

Open mfeit-internet2 opened 5 years ago

mfeit-internet2 commented 5 years ago

A user of one of my pyjq-enabled programs observed that large integers we being returned from his filters as floats, causing validation that happened downstream to fail. I narrowed the problem to this example (seen on 2.1.0 and 2.2.0 with Python 2.7.5 and Cython 0.29.1):

import pyjq

for script in [
        ".x = 2147483647",
        ".x = 2147483648"
]:  
    filt = pyjq.compile(script)
    print script, "->", filt.all(None) 
.x = 2147483647 -> [OrderedDict([(u'x', 2147483647)])]
.x = 2147483648 -> [OrderedDict([(u'x', 2147483648.0)])]

Jq does not exhibit this behavior:

% jq -n '.x = 2147483647'
{
  "x": 2147483647
}

% jq -n '.x = 2147483648'
{
  "x": 2147483648
}

...at least not until the numbers get considerably larger:

% jq -n '.x = 9999999999999998'
{
  "x": 9999999999999998
}

% jq -n '.x = 9999999999999999'
{
  "x": 1e+16
}

Because the misbehaving numbers start life as text in a script, I don't think they're being turned into floats on the way in. If I run a filter of .x = 2147483648 | .y = (.x | tostring), the string is correct, so at least to that point jq still understands it as an integer.

The best I can tell is that this is happening on the way out in this code:

cdef object jv_to_pyobj(jv jval):
    kind = jv_get_kind(jval)
    ...
    elif kind == JV_KIND_NUMBER:
        v = jv_number_value(jval)
        if jv_is_integer(jval):
            return int(v)
        return v
    ...

Any thoughts on whether this should be done differently, or should I look for a bug in the code Cython generates?

Thanks.

doloopwhile commented 5 years ago

It seems impossible to fix.

While Python's int have unlimited precision, jq's Number have limited (32bit?) precision.

mfeit-internet2 commented 5 years ago

The best way to find a solution to a problem is to be told there isn't one. :-)

See https://github.com/perfsonar/pscheduler/issues/717#issuecomment-472962004 for an update.

VarutA commented 5 years ago

I got the same problem. Is this issue going to be solve?

doloopwhile commented 5 years ago

It seems that this bug's cause is that small INT_MAX and INT_MIN are set by cython.

mfeit-internet2 commented 5 years ago

I have a working fix for this problem that's not in production yet but has been tested well enough that it's in the pipeline. Because this is a hybrid jq/PyJQ problem, both packages have to be changed:

The perfSONAR project, where these changes originated, required an immediate fix, so we build and supply our own versions of both. The patches can be found in the project sources:

Alternately, you could use something similar to the stopgap fix I built into perfSONAR's PyJQ wrapper class while the problem was solved: https://github.com/perfsonar/pscheduler/blob/4.1.6/python-pscheduler/pscheduler/pscheduler/jqfilter.py

doloopwhile commented 4 years ago

What I found:

Jq and JavaScript holds any number (both of integer and float point number) as single number type Number. Number is implemented with double in Jq.

So, It looks reasonable that pyjq never convert Number to int ( and always convert Number to float). Current pyjq converts Number to int if the value is in range [INT_MIN, INT_MAX].

doloopwhile commented 4 years ago

Please give us your opinion:

Will "To convert Number to float always" work? Do you have a code which is broken by the strategy ?

mfeit-internet2 commented 4 years ago

I do have code in the field that will break if integers start coming out of PyJQ as floats but can continue running my patched PyJQ if necessary.

ECMA-404 is careful not to dictate internal representation, but it does treat JSON as a sequence of characters. From that standpoint, my take would be that PyJQ should make an effort to preserve type going into and coming out of jq in the interest of making sure data won't come out as a different sequence when converted from JSON to text. For example, if I give PyJQ input of 1234 and process it with a jq filter of ., converting to float by default would result in output of 1234.0, which wouldn't be the same sequence.

At the moment, jq is unable to handle the arbitrarily-large numbers that Python does, although there is an effort underway for that which has not yet been merged with the code. It might be worth maintaining the current state of things until that effort is complete.

televi commented 4 years ago

I'm sorry if this is a bit long - I tried to summarize roughly a day of investigation as concisely as I could without losing any relevant information.

We just stumbled on this issue as well. I'm not sure if this is relevant, but per RFC-7159, implementations should agree on the representation of anything between [-(2^53)+1, (2^53)-1] when the software supports IEEE 754-2008 (IEEE Standard for Floating-Point Arithmetic) binary64.

Assuming I'm reading that correctly, shouldn't both jq and pyjq correctly handle 2**53-1 as some type of int and represent it the same way? It looks like jq does this correctly:

$ jq --version
jq-1.6
$ echo '{"foo": 9007199254740991}' | jq '.'
{
  "foo": 9007199254740991
}

but pyjq does not:

$ pip list | grep pyjq
pyjq                      2.4.0 
$ python3 -c 'from pyjq import compile; x = compile("."); print(x.first({"foo": 9007199254740991}))'
{'foo': 9007199254740991.0}

I saw a comment re: Cython and INT_MAX, but I didn't understand why that would be the issue here. INT_MAX in Cython is from limits.h and would be in the range of 2^15. LONG_MAX would be 2^31 and LLONG_MAX would be 2^63. So if anything, the issue would seem to be related to LONG_MAX rather than INT_MAX.

I'm not a Cython expert (barely a novice) so the following may be completely wrong, but the above led me to look in in _pyjq.c to try and understand what may be going on.

When looked in _pyjq.c, I found something that didn't make sense. The two functions I looked at that confused me are static CYTHON_INLINE PyObject* __Pyx_PyInt_From_int(int value) and static CYTHON_INLINE PyObject* __Pyx_PyInt_From_long(long value). Both are doing similar things to convert a C int type into a Python int type.

In __Pyx_PyInt_From_int at around line 6591 in the case where an int is a signed int, it checks to see if an int is smaller than a long before finally checking to see if an int will fit in a long long before returning PyLong_FromLongLong((PY_LONG_LONG) value).

In __Pyx_PyInt_From_long, however, at around line 6644 it checks to see if sizeof(long) <= sizeof(long). I think that will always be true independent of the architecture so the else clause will never be be used. Essentially, I think that code will always result in __Pyx_PyInt_From_long returning PyInt_FromLong((long) value) and never trying to return PyLong_FromLongLong((PY_LONG_LONG) value).

I found a similar construct in static CYTHON_INLINE long __Pyx_PyInt_As_long at line 6985 where it is converting a Python int to a C int.

The above would seem to fit the pattern of results - a C long is converted from a long long and always forced to fix back into a long due to sizeof(long) <= sizeof(long) always being true (and I think maybe vice versa).

televi commented 4 years ago

OK - I think I have this figured out. It's not the above (but at least I learned some cool stuff about how Cython works).

When pyjq builds, its using the jq from the dependencies folder and that is version 1.5. You can see that in setup.py around line 32. That version has a problem in that the "jv_is_integer()" function does this:

int jv_is_integer(jv j){
  if(jv_get_kind(j) != JV_KIND_NUMBER){
    return 0;
  }
  double x = jv_number_value(j);
  if(x != x || x > INT_MAX || x < INT_MIN){
    return 0;
  }

  return x == (int)x;
}

The 1.6 version of jq doesn't do that - it does this:

int jv_is_integer(jv j){
    if(!JVP_HAS_KIND(j, JV_KIND_NUMBER)){
        return 0;
    }
    double x = jv_number_value(j);
    double ipart;
    double fpart = modf(x, &ipart);
    return fabs(fpart) < DBL_EPSILON;
}

I think the solution here is to update the tar ball for jq in dependencies to 1.6.

dr-duplo commented 2 years ago

Any progress on that one? I saw efforts to bring pyjq to jq version 1.6 which would solve the issue. Any update?