Use exponential format for decimals with relatively large scale

declanvk commented 10 months ago

This PR is related to some of the comments in #108, but there is not a specific issue created separately for this. Please let me know if you want me to create an issue first for some discussion.

The first commit of the PR is mostly just adding tests to capture the current behavior.

The second commit is the more important, it adds a different formatting code path to the fmt::Display impl, so that numbers that have a large (absolute) scale relative to their integer representation length will be displayed in exponential format. The second commit message has more detail.

I chose the (abs_int.len() as i64 - self.scale - 1) as the criteria for the cutoff because I didn't want it to trigger only based on the scale. I imagined a scenario where the scale is large (either negative or positive), but it is inflated because the int_val has a large number of digits.

I'm specifically targeting cases where someone could input a small string like 1E100000 and then crash a program by making it do a huge allocation.

Thanks for making and maintaining the bigdecimal crate! We find it very useful

akubera commented 10 months ago

This is much needed, thanks. I think started that at some point, but I can't find the commits 😬 .

Like the other traits, I'd like to break this out into an impl_fmt.rs file. So I'll do that in trunk and rebase your branch onto that to keep accurate attribution.

I'll break your function-scope format_full_scale and format_exponential out into module-scope, so they may be individually, without dependency on EXPONENTIAL_FORMAT_THRESHOLD.

akubera commented 10 months ago

Regarding EXPONENTIAL_FORMAT_THRESHOLD, I think I'll make that a compile-time configurable; like RUST_BIGDECIMAL_DEFAULT_PRECISION.

Most people wont care, but someone might find it useful.

It looks like Python's default decimal formatter only keeps 5 leading zeros before going to exponential form:

>>> Decimal("1e-6")
Decimal('0.000001')
>>> Decimal("1e-7")
Decimal('1E-7')

>>> Decimal("11e-7")
Decimal('0.0000011')
>>> Decimal("11e-8")
Decimal('1.1E-7')

And trailing zeros are precision-dependent, which makes sense as those could indicate accuracy:

>>> Decimal("1e0")
Decimal('1')
>>> Decimal("1e1")
Decimal('1E+1')
>>> Decimal("1e3")
Decimal('1E+3')

>>> Decimal("10e1")
Decimal('1.0E+2')

>>> Decimal("10000e3")
Decimal('1.0000E+7')

__str__ follows same rules:

>>> str(Decimal("1e-6"))
'0.000001'
>>> str(Decimal("1e-7"))
'1E-7'
>>> str(Decimal("100e7"))
'1.00E+9'

akubera commented 10 months ago

I found my feature/impl-fmt branch. I rebased that onto trunk and I'll get your changes in there somewhere.

akubera commented 10 months ago

I applied your changes to the split file instead of src/lib.rs.

exponential-format

Only significant change was moving the format_exponential out of the fmt method, and leaving the other code in the function as-is (i.e. jump to format_exponential if length condition holds, otherwise use the old code)

I may change this back to your two-function approach, and move the whole thing into yet another function that parameterizes the threshold-value, so we can write nicer tests.

akubera commented 10 months ago

One thing to note: I set EXPONENTIAL_FORMAT_THRESHOLD = 2 (small value) and the exponential form is chosen when given a value like 7000:

println!("{}", BigDecimal::from(7000)); -> "7.000E3"

Is that expected behavior?

declanvk commented 10 months ago

Regarding EXPONENTIAL_FORMAT_THRESHOLD, I think I'll make that a compile-time configurable; like RUST_BIGDECIMAL_DEFAULT_PRECISION.

Sounds good to me! That is a cool technique you're using in the build.rs

Is that expected behavior?

This makes sense, but I did not anticipate it ahead of time. Sinceabs_int = "7000" and scale = 0, then we get "7000".len() - 0 > 2 which evaluates to true and goes to the exponential path.

I think this points out a weakness in the condition, I previously assumed that the trailing zeroes would not have this affect.

What is your preference on this? Should I make it count the abs_int.len() - num_trailing_zeroes?

akubera commented 10 months ago

The special case scale == 0 always means the BigDecimal is an integer, and the BigInt may be printed verbatim. So that's one.

akubera commented 10 months ago

Comparing to other libraries:

// Java
import java.math.*;
class BigDecimalTest {
    public static void main(String[] args) {
        String[] strs = new String[]{
            "0.123",
            "0.000001",
            "0.0000001",
            "100",
            "100e3",
            "1000000000",
            "1000000000.00000",
            "1000000000e1",
        };
        for (String s : strs) {
           BigDecimal n = new BigDecimal(s);
           System.out.println(s + " -> " + n);
        }
    }
}

outputs

0.123 -> 0.123
0.000001 -> 0.000001
0.0000001 -> 1E-7
100 -> 100
100e3 -> 1.00E+5
1000000000 -> 1000000000
1000000000.00000 -> 1000000000.00000
1000000000e1 -> 1.000000000E+10

# Python
from decimal import Decimal

strs = [
    "0.123",
    "0.000001",
    "0.0000001",
    "100",
    "100e3",
    "1000000000",
    "1000000000.00000",
    "1000000000e1",
]

for s in strs:
    n = Decimal(s)
    print(f"{s} -> {n}")

0.123 -> 0.123
0.000001 -> 0.000001
0.0000001 -> 1E-7
100 -> 100
100e3 -> 1.00E+5
1000000000 -> 1000000000
1000000000.00000 -> 1000000000.00000
1000000000e1 -> 1.000000000E+10

(good, it's the same)

Ruby:

require 'bigdecimal'

strs = [
    "0.123",
    "0.000001",
    "0.0000001",
    "100",
    "100e3",
    "1000000000",
    "1000000000.00000",
    "1000000000e1",
]

for s in strs do
  n = BigDecimal s
  print(s, " -> ", n, "\n")
end

0.123 -> 0.123e0
0.000001 -> 0.1e-5
0.0000001 -> 0.1e-6
100 -> 0.1e3
100e3 -> 0.1e6
1000000000 -> 0.1e10
1000000000.00000 -> 0.1e10
1000000000e1 -> 0.1e11

...... well I regret trying that one. They ALL start with 0. !?

akubera commented 10 months ago

0.1  ->  0.1
0.10  ->  0.10
0.100  ->  0.100
0.1000  ->  0.1000
...
0.100000000  ->  0.100000000
...
0.100000000000000000000000000  ->  0.100000000000000000000000000

So trailing zeros don't matter.

I think the rule is


// compare scale to zero
match self.scale.cmp(&0) {
   // simply print integer if scale is zero
   Equal => print(abs_int),

   // decimal point is "to the right" of the end of the integer  
   // always print exponential form
   // (example 100e3 => 100xxx. )
   Less => print_exponential_form(...),

   // greater means decimal point is to the left of the end of the integer 
   // we should do exponential form if the decimal point is past the left of
   // the start (significant) digit 
   //       123e-2 => 1.23         (scale - len = 2 - 3 = -1)
   //       123e-5 => 0.00123      (scale - len = 5 - 3 = 2)
   //       123e-9 => 0.000000123  (scale - len = 9 - 3 = 6 > threshold => 1.23e-7)
   _ => if self.scale - abs_int.len() as i64  > THRESHOLD  {
          print_exponential_form(...)
        } else {
          print_full_form(...)
        }
}

akubera commented 10 months ago

Ok: We have decimal with digits

dddddddddd

Only thing that determines exponential is placement of the decimal point. Value or number of digits do not matter:

          dddddddddd0…00.    <- exponential
          dddddddddd.        <- full (no decimal point to be printed)
          ddddd.dddd         <- full
         .dddddddddd         <- full
     .0000dddddddddd         <- full
.0000…0000dddddddddd     <- exponential if n > threshold
 |-- n --|

So I'm going with

// check if scale is outside of "boundary"
if scale < 0 || (scale as usize > abs_int.len() + threshold) {
    exponential
} else {
    full
}

akubera commented 10 months ago

These tests are passing for threshold = 2 👏

    impl_case!(case_0d123: "0.123" => "0.123");
    impl_case!(case_0d0123: "0.0123" => "0.0123");
    impl_case!(case_0d00123: "0.00123" => "0.00123");
    impl_case!(case_0d000123: "0.000123" => "1.23E-4");

    impl_case!(case_123d: "123." => "123");
    impl_case!(case_123de1: "123.e1" => "1.23E3");

test impl_fmt::test_display::case_0d000123 ... ok
test impl_fmt::test_display::case_123d ... ok
test impl_fmt::test_display::case_0d00123 ... ok
test impl_fmt::test_display::case_123de1 ... ok
test impl_fmt::test_display::case_0d123 ... ok
test impl_fmt::test_display::case_0d0123 ... ok

the last one there is almost consistent with Python. Do we force the +?

>>> Decimal("123.e1")
Decimal('1.23E+3')

declanvk commented 10 months ago

Nice work on adjusting the rule! I like the latest iteration, focusing in on the placement of the digits really brings it into focus

the last one there is almost consistent with Python. Do we force the +?

Yeah I think go for it! Looking at the python and java examples, it seems like they always include the sign on the exponent

declanvk commented 10 months ago

I left some nonblocking comments (on my own PR hahaha), let me know what you think! I can go ahead and implement anything, I feel bad making you do all this work!

Thanks again for looking at this

akubera commented 10 months ago

I didn't mean to force push onto your branch, but if you like the changes I'm glad I did. Thanks for your comments, they look good. I'll rebase again to correct the last old-rust compatibility issue.

I removed the updated threshold check, I'm working on redoing that commit to also update the tests.

I'll look at all this again tonight.

declanvk commented 10 months ago

I didn't mean to force push onto your branch, but if you like the changes I'm glad I did.

I'm glad you did! The changes you made look good

akubera commented 10 months ago

I'm about ready to call this done. I wrote a little code-writing python script to generate unit-test mods so the Rust bigdecimal formatting of {:XYZ} aligns with the same format in Python.

akubera commented 8 months ago

This was merged and released last night in 0.4.3.

I forgot to use the GitHub ui, oops. To make it look merged I'm going to merge it now into trunk and then force push updates away.

declanvk commented 8 months ago

Thanks for all your work on this @akubera !

akubera commented 8 months ago

'twas a long time coming. Let me know right away if there's any problems!

akubera / bigdecimal-rs

Use exponential format for decimals with relatively large scale #121