apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.47k stars 3.52k forks source link

[C++] Allow safe cast from int64 to float64 in compute kernel for numerics that can be accurately represented above 2^53 #35563

Open danepitkin opened 1 year ago

danepitkin commented 1 year ago

Describe the enhancement requested

See discussion found in https://github.com/apache/arrow/issues/34901.

To summarize, safe casting of int64 to float64 for numerics greater than 2^53 is currently not supported in Arrow. However, according to the Double-precision floating point format[1] defined in the IEEE 754-2008 standard, there are certain integers that can be safely cast to floating point. From 2^54 to 2^55, the representable numbers are every 2^1. From 2^54 to 2^55, the representable numbers are every 2^2. From 2^55 to 2^56, the representable numbers are every 2^3, etc.

For example, integer 18014398509481984 can be safely cast between int64 and float64 according to the IEEE 754-2008 standard, but not in arrow. (However, integer 18014398509481983 CANNOT be cast safely).

>>> arr = pa.array([18014398509481984], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992

[1] https://en.wikipedia.org/wiki/Double-precision_floating-point_format

Component(s)

C++

zfoobar commented 1 year ago

hi @danepitkin - I mentioned some of this in comment thread for Issue 35584.

Quick question: what was the design decision that drove computing IEEE 754 compliance at the python layer, when the arrow C++ runtime can already be relied upon to determine whether or not a given int64 can be safely converted (i.e. with a static casting check, etc.)

danepitkin commented 1 year ago

I believe the original implementation of the python-to-arrow functionality many years ago was supposed to be a serialization/deserialization like pickle. Nowadays, I would say it's primarily a legacy implementation that eventually should be rewritten to take advantage of the compute kernels in C++ if possible.

zfoobar commented 1 year ago

Hmm. I don't see much value in putting in a bandaid in helpers.py that allows you to safely convert the appropriate numbers above 2^53. If there is a sufficient design reason why Arrow doesn't support unsafe float conversions, it should be noted before anyone takes a stab at this. If there isn't a sufficient reason, then we should just cut to the chase and let the C++ side convert it and return it (or slightly lazier, just let python convert it, I think it's just a C double under the hood)

My two cents: people using floats this large should probably be aware of how floating point works, and proceed at their own risk with respect to the precision loss and/or rounding, rather than having the hammer thrown at them, even for safe values (as is currently the case).

jorisvandenbossche commented 1 year ago

Let's keep this discussion focused on the actual logic in the int->float C++ casting code.

For the points made about the logic on the python side, I answered some of those in https://github.com/apache/arrow/issues/35584#issuecomment-1562653549

danepitkin commented 1 year ago

+1 to @jorisvandenbossche 's comment. They are much more knowledgeable on PyArrow than I am.