Open danepitkin opened 1 year ago
hi @danepitkin - I mentioned some of this in comment thread for Issue 35584.
Quick question: what was the design decision that drove computing IEEE 754 compliance at the python layer, when the arrow C++ runtime can already be relied upon to determine whether or not a given int64 can be safely converted (i.e. with a static casting check, etc.)
I believe the original implementation of the python-to-arrow functionality many years ago was supposed to be a serialization/deserialization like pickle. Nowadays, I would say it's primarily a legacy implementation that eventually should be rewritten to take advantage of the compute kernels in C++ if possible.
Hmm. I don't see much value in putting in a bandaid in helpers.py that allows you to safely convert the appropriate numbers above 2^53. If there is a sufficient design reason why Arrow doesn't support unsafe float conversions, it should be noted before anyone takes a stab at this. If there isn't a sufficient reason, then we should just cut to the chase and let the C++ side convert it and return it (or slightly lazier, just let python convert it, I think it's just a C double under the hood)
My two cents: people using floats this large should probably be aware of how floating point works, and proceed at their own risk with respect to the precision loss and/or rounding, rather than having the hammer thrown at them, even for safe values (as is currently the case).
Let's keep this discussion focused on the actual logic in the int->float C++ casting code.
For the points made about the logic on the python side, I answered some of those in https://github.com/apache/arrow/issues/35584#issuecomment-1562653549
+1 to @jorisvandenbossche 's comment. They are much more knowledgeable on PyArrow than I am.
Describe the enhancement requested
See discussion found in https://github.com/apache/arrow/issues/34901.
To summarize, safe casting of int64 to float64 for numerics greater than 2^53 is currently not supported in Arrow. However, according to the Double-precision floating point format[1] defined in the IEEE 754-2008 standard, there are certain integers that can be safely cast to floating point. From 2^54 to 2^55, the representable numbers are every 2^1. From 2^54 to 2^55, the representable numbers are every 2^2. From 2^55 to 2^56, the representable numbers are every 2^3, etc.
For example, integer 18014398509481984 can be safely cast between int64 and float64 according to the IEEE 754-2008 standard, but not in arrow. (However, integer 18014398509481983 CANNOT be cast safely).
[1] https://en.wikipedia.org/wiki/Double-precision_floating-point_format
Component(s)
C++