SethMMorton / fastnumbers

Super-fast and clean conversions to numbers for Python.
https://pypi.org/project/fastnumbers/
MIT License
105 stars 13 forks source link

Big numbers handling: silencing arbitrary digits beyond least significant bit #69

Closed nopria closed 11 months ago

nopria commented 1 year ago

Describe the bug

fast_real(3.453e78)

returns

3453000000000000090829207640410620202184021047720495530415184879687805108748288

instead of

3453000000000000000000000000000000000000000000000000000000000000000000000000000

Environment (please complete the following information):

Observations

I know that the current behaviour is due to unavoidable numeric representation error, however the result is somewhat "disturbing". IMHO the ideal behaviour should be to round the result to the digit before the least significant bit. It's not a surprise that

math.ulp(3.453e78)

gives

4.113761393303015e+62

which tells us that the last 62 digits of the integer output of fast_real(3.453e78) are, so to say, arbitrary and therefore misleading.

SethMMorton commented 1 year ago

fastnumbers aims to return the same results that python returns, and this is the same result as python:

>>> int(3.453e78)
3453000000000000090829207640410620202184021047720495530415184879687805108748288
nopria commented 1 year ago

It would be nice adding an optional behaviour for fastnumbers to allow overcoming such Python limitations.

Otherwise, and in the meantime, I'll have to use a workaround like the following one, which I guess is not very fast:

>>> from decimal import Decimal
>>> int(Decimal(str(3.453e78)))
3453000000000000000000000000000000000000000000000000000000000000000000000000000
SethMMorton commented 1 year ago

But consider this:

>>> decimal.Decimal(3.453e78)
Decimal('3453000000000000090829207640410620202184021047720495530415184879687805108748288')

When you do e.g. a = 3.453e78, that number is already 3453000000000000090829207640410620202184021047720495530415184879687805108748288.0. So, when you do int(a) , Python is accurately and correctly returning 3453000000000000090829207640410620202184021047720495530415184879687805108748288.

The reason that Decimal(str(3.453e78)) gives what you expect is that Python has a heuristic for writing out string representations of floats that only gives what it anticipates is the user's desired human-readable output, and so the Decimal class only sees the 3 digits you gave.

>>> str(3.453e78)
'3.453e+78'
>>> format(3.453e78, ".0f")
'3453000000000000090829207640410620202184021047720495530415184879687805108748288'

So, when you do fast_real(3.453e78), fastnumbers is actually seeing 3453000000000000090829207640410620202184021047720495530415184879687805108748288.0. For it to return 3453000000000000000000000000000000000000000000000000000000000000000000000000000 instead would mean that it has to guess that you only want to keep the first 4 digits. But, from fastnumbers's perspective, fast_real(3.453e78) is no different from fast_real(3453000000000000090829207640410620202184021047720495530415184879687805108748288.0), and in the latter case it seems like all those digits are important.

I simply don't see how this request can be satisfied as given.


Now, if the request were to make it so that fast_real("3.453e78") returned 3453000000000000000000000000000000000000000000000000000000000000000000000000000, that is possible and could be done.

nopria commented 1 year ago

Yes, making fast_real("3.453e78") return

3453000000000000000000000000000000000000000000000000000000000000000000000000000

it's the first request and I think would be a nice improvement.

About the rest, I understand your explanation and I'm aware that fastnumbers sees the numeric literal 3.453e78 as

3453000000000000090829207640410620202184021047720495530415184879687805108748288

but at the same time I guess (correct me if I'm wrong) that fastnumbers is aware that the least significant bit of the numeric literal 3.453e78 is 4.113761393303015e+62 (given by math.ulp(3.453e78)), so it could be aware that all the last 62 digits of

34530000000000000 90829207640410620202184021047720495530415184879687805108748288

which are precisely those after the added space, are arbitrary and therefore could be set to zero.

If this is the case, I think that an option to let fastnumbers silence arbitrary digits (those beyond the least significant bit) would be a valuable feature.

SethMMorton commented 1 year ago

I'll take a look, but I'm not really sure how one "silences" digits, especially in the python long type. I assume you have an idea?

nopria commented 1 year ago

After some tentatives I came up with the following pure numerical way (in Python), which handles rounding and negative numbers and delivers the same results as decimal.Decimal(str(n)) for the tests I made on big floats:

import math
n=3.453e78
n_sign = (n>0)-(n<0)
digits = 1+int(math.log10(math.ulp(abs(n))))
int_n = int(abs(n))
roundup = 1 if int_n // 10**digits % 10 == 9 else 0 # 1 if least decimal significant digit is 9, else 0
fixed_n = (roundup + int_n // (10 ** digits)) * (10 ** digits) * n_sign

gives

>>> n = 3.453e78
>>> int(n)
3453000000000000090829207640410620202184021047720495530415184879687805108748288
>>> fixed_n
3453000000000000000000000000000000000000000000000000000000000000000000000000000
>>> n = -6.853808e38
>>> int(n)
-685380799999999969504234500079333933056
>>> fixed_n
-685380800000000000000000000000000000000
SethMMorton commented 11 months ago

I've been working on this off and on for the past few months. I have come to the conclusion that this is possible, but it won't be very fast (at least not compared to all the other operations this library does).

The problem is that these "arbitrary digits" only occur for very large numbers, much larger than will fit into a 64-bit integer. That means that under-the-hood, they must be represented in Python's arbitrary-sized integer representation, and as a result raw C++ arithmetic cannot be performed on them - all math must be through Python's C API. Doing this is pretty much as slow as Python.

I tried a hybrid approach where I used C++ types and arithmetic to calculate the number of digits that were "arbitrary", then use Python's round to set them to zero... this was both very slow sometimes does not return the same result as Decimal due to rounding of the internal representation of double that is unrecoverable. So, "post-processing" the result after conversion to integer to remove the digits is not the correct method.

I am going to investigate a method to convert from a string representation to get the results you request. I am imagining an algorithm something like this:

  1. Determine if a string containing a floating point number is "int-like" - if not then do not do any further processing
  2. Split a string where the exponent is defined
  3. Read the part before the exponent as an integer (e.g. ignore the decimal if it exists)
  4. Multiply the resulting Python integer by the exponent number

This should result in an integer that gives identically the same result as Decimal is giving, and should be pretty fast - at least faster that what I have tried so far.

In order to support floats (instead of strings), I will have to first convert the float to a string using something analogous to str(float), then use the above algorithm. This of course is not ideal because it requires a memory allocation and lots of extra work, so will not be overly fast, but it will be more consistent with the spirit of the request.

I'll do some bench-marking when implemented, and think about default behaviors, etc.

SethMMorton commented 11 months ago

try_real and try_forceint now have the denoise options to enable this. See the README for more details on the behavior and caveats, and the profiling data to see how it compares speedwise.

nopria commented 11 months ago

Nice job Seth. To make easier for readers to know about the denoise option, in the documentation at "See README for more details" it would be convenient to have a link to the relevant README section (https://github.com/SethMMorton/fastnumbers/blob/main/README.rst#about-the-denoise-option). Many people might be interested in this new option, which in Python is unique, I guess.