fastfloat / fast_float

Fast and exact implementation of the C++ from_chars functions for number types: 4x to 10x faster than strtod, part of GCC 12, Chromium and WebKit/Safari
Apache License 2.0
1.35k stars 121 forks source link

Consider Optimizing Decimal Operations #93

Closed Alexhuszagh closed 2 years ago

Alexhuszagh commented 3 years ago

Feature Request

Although the code for parsing decimals is quite nice already, a fair amount of work I've done on lexical (my Rust float parser) shows that when the slow path is invoked, we can get quite faster performance in a few cases, and a minimal version of this can prove quite easy, readable, and correct.

This is due to the use of big-integer arithmetic and a few algorithms optimized for big integer math for parsing decimal strings, which minimizes the number of operations relative to the decimal class in fast_float. There is more code involved, however, this code relies on fewer magic numbers and instead uses simple, big-integer algorithms instead. The only additional static storage required is ~32 64-bit integers, which is considerably smaller than the decimal implementation.

This performance gap is quite noticeable with floats like "8.988465674311580536566680e307", where the performance differences goes from ~500ns/iter (lexical) to ~14.465us/iter (fast_float). For denormal floats or those with negative exponents like "8.442911973260991817129021e-309", the performance is slightly better, from ~58.073 ns/iter (lexical) ~69.264ns/iter (fast_float). This scales very well also with input size: scaling well to 768 digits (767, the max for a double-precision float, with 1 extra) and beyond.

The actual implementation is quite simple, in fact, it uses a fairly minimal subset of big-integer arithmetic, which can be implemented in <700 lines of code, and then the parsing algorithms are quite easy to understand and implement. If this is of interest, I'd be happy to submit a PR.

Positive Exponent Implementation

Here is a Rust version of the code required to implement this, obviously, this would be translated to C++:

/// Generate the significant digits with a positive exponent relative to mantissa.
pub fn positive_digit_comp<F: RawFloat>(
    mut bigmant: Bigint,
    exponent: i32,
) -> ExtendedFloat80 {
    // Simple, we just need to multiply by the power of the radix.
    // Now, we can calculate the mantissa and the exponent from this.
    // The binary exponent is the binary exponent for the mantissa
    // shifted to the hidden bit.
    bigmant.pow(10, exponent as u32);

    // Get the exact representation of the float from the big integer.
    // Himant checks **all** the remaining bits after the mantissa,
    // so it will check if **any** truncated digits exist.
    let (mant, is_truncated) = bigmant.hi64();
    let exp = bigmant.bit_length() as i32 - 64 + F::EXPONENT_BIAS;
    let mut fp = ExtendedFloat80 {
        mant,
        exp,
    };

    // Shift the digits into position and determine if we need to round-up.
    shared::round::<F, _>(&mut fp, |f, s| {
        shared::round_nearest_tie_even(f, s, |is_odd, is_halfway, is_above| {
            is_above || (is_halfway && is_truncated) || (is_odd && is_halfway)
        });
    });
    fp
}

Here, the exponent is relative to the significant digits, and where ExtendedFloat80 is just AdjustedMantissa or an 80-bit extended-precision float with a biased exponent. Our bigmant is just the significant digits parsed as a big integer. hi64 is a function that just gets the high 64-bits from big integer (for the significant digits), and checks if any bits below are truncated (for rounding), and EXPONENT_BIAS is 1075 for a 64-bit float.

This implementation is quite simple: first we get the max power of the radix (in this case, 10) that can be stored in a limb, or 10^N <= 2^BITS - 1. For 32-bit limbs, this is 9, for 64-bit limbs, this is 19. Next, we get the max, native integer for this power (or 10^9 for 32-bit limbs, 10^19 for 64-bit limbs). Then, we parse the maximum number of digits we can into a native limb, then add the limb to the big integer, or effectively the following logic:

We then parse using the following logic:

let mut counter: usize = 0;
let mut count: usize = 0;
let mut value: Limb = 0;
let step = ...;
let mut result = Bigint::new();

// Check if we've reached our max native value.
for &digit in digits {
    // Add our temporary values.
    value *= 10;
    value += digit - 0x30;

    counter += 1;
    count += 1;
    if counter == step {
        result.mul_small(max_native);
        result.add_small(value);
        counter = 0;
        value = 0;
    }

    // Check if we've exhausted our max digits.
    if count == max_digits {
        // Add temporary...
        ...
    }
}

In total, we need operations for the following:

  1. Scalar add and mul, with the ability to detect overflow or the carry.
  2. Addition and multiplication of a scalar to a big integer.
  3. Grade-school multiplication of two big integers (asymptotically faster algorithms aren't applicable, since the inputs are too small).
  4. SHL operating for the big integer, both shifting bits and limbs.
  5. An efficient big-integer power algorithm.

The entirety of the big integer algorithms is <700 lines of code, although my version is extensively documented. The only one that here that's remotely interesting is the power algorithm, which uses a single pre-computed large power-of-5, and a a small number of pre-computed small powers. This is effectively just:

pub const SMALL_INT_POW5: [u64; 28] = [
    1,
    5,
    25,
    125,
    625,
    3125,
    15625,
    78125,
    390625,
    1953125,
    9765625,
    48828125,
    244140625,
    1220703125,
    6103515625,
    30517578125,
    152587890625,
    762939453125,
    3814697265625,
    19073486328125,
    95367431640625,
    476837158203125,
    2384185791015625,
    11920928955078125,
    59604644775390625,
    298023223876953125,
    1490116119384765625,
    7450580596923828125,
];

/// Pre-computed large power-of-5 for 32-bit limbs.
#[cfg(not(all(target_pointer_width = "64", not(target_arch = "sparc"))))]
pub const LARGE_POW5: [u32; 10] = [
    4279965485, 329373468, 4020270615, 2137533757, 4287402176, 1057042919, 1071430142, 2440757623,
    381945767, 46164893,
];

/// Pre-computed large power-of-5 for 64-bit limbs.
#[cfg(all(target_pointer_width = "64", not(target_arch = "sparc")))]
pub const LARGE_POW5: [u64; 5] = [
    1414648277510068013,
    9180637584431281687,
    4539964771860779200,
    10482974169319127550,
    198276706040285095,
];

/// Step for large power-of-5 for 32-bit limbs.
pub const LARGE_POW5_STEP: u32 = 135;

pub fn pow(x: &mut Vec, base: u32, mut exp: u32) {
    let large = &LARGE_POW5;
    let step = LARGE_POW5_STEP;
    while exp >= step {
        large_mul(x, large);
        exp -= step;
    }

    // Now use our pre-computed small powers iteratively.
    let small_step = if LIMB_BITS == 32 {
        13
    } else {
        27
    };
    let max_native = (base as Limb).pow(small_step);
    while exp >= small_step {
        small_mul(x, max_native);
        exp -= small_step;
    }
    if exp != 0 {
        let small_power = SMALL_INT_POW5[exp as usize];
        small_mul(x, small_power as Limb);
    }
}

Which is effectively self-explanatory: use large powers of 5 and the large, grade-school multiplication algorithm to minimize the number of multiplications, and then use a small-precomputed table for the remainder. The real power implementation splits this into a power-of-5 multiplication and a left-shift (for the power-of-2).

Negative Exponent Implementation

The negative exponent implementation is likewise simple, but a little different: we need our big mantissa and exponent from before, but we also needed an extended-float prior to rounding. In order to make this work with the existing Lemire algorithm, I simply add i16::MIN (-2^15) rather than set the exponent to -1, so the real extended-float can be passed over. This requires only trivial modifications to existing code, which has no impact on performance for faster algorithms.

First, we create a big integer representing b+h, so we can determine if we round to b+u or to b. This is very simple:

/// Calculate `b` from a a representation of `b` as a float.
#[inline]
pub fn b<F: RawFloat>(float: F) -> ExtendedFloat80 {
    ExtendedFloat80 {
        mant: float.mantissa().as_u64(),
        exp: float.exponent(),
    }
}

/// Calculate `b+h` from a a representation of `b` as a float.
#[inline]
pub fn bh<F: RawFloat>(float: F) -> ExtendedFloat80 {
    let fp = b(float);
    ExtendedFloat80 {
        mant: (fp.mant << 1) + 1,
        exp: fp.exp - 1,
    }
}

Next, we then calculate both the numerator and denominator of a ratio representing the float:

/// Generate the significant digits with a negative exponent relative to mantissa.
pub fn negative_digit_comp<F: RawFloat>(
    bigmant: Bigint,
    mut fp: ExtendedFloat80,
    exponent: i32,
) -> ExtendedFloat80 {
    // Ensure our preconditions are valid:
    //  1. The significant digits are not shifted into place.
    debug_assert!(fp.mant & (1 << 63) != 0);

    // Get the significant digits and radix exponent for the real digits.
    let mut real_digits = bigmant;
    let real_exp = exponent;
    debug_assert!(real_exp < 0);

    // Round down our extended-precision float and calculate `b`.
    let mut b = fp;
    shared::round::<F, _>(&mut b, shared::round_down);
    let b = extended_to_float::<F>(b);

    // Get the significant digits and the binary exponent for `b+h`.
    let theor = bh(b);
    let mut theor_digits = Bigint::from_u64(theor.mant);
    let theor_exp = theor.exp;

    // We need to scale the real digits and `b+h` digits to be the same
    // order. We currently have `real_exp`, in `radix`, that needs to be
    // shifted to `theor_digits` (since it is negative), and `theor_exp`
    // to either `theor_digits` or `real_digits` as a power of 2 (since it
    // may be positive or negative). Try to remove as many powers of 2
    // as possible. All values are relative to `theor_digits`, that is,
    // reflect the power you need to multiply `theor_digits` by.
    let binary_exp = theor_exp - real_exp;
    let halfradix_exp = -real_exp;

    if halfradix_exp != 0 {
        theor_digits.pow(5, halfradix_exp as u32);
    }
    if binary_exp > 0 {
        theor_digits.pow(2, binary_exp as u32);
    } else if binary_exp < 0 {
        real_digits.pow(2, (-binary_exp) as u32);
    }
    ...
}

Finally, we compare these two approximations, and determine if we round-up or down:

pub fn negative_digit_comp<F: RawFloat>(
    bigmant: Bigint,
    mut fp: ExtendedFloat80,
    exponent: i32,
) -> ExtendedFloat80 {
    ...

    // Compare our theoretical and real digits and round nearest, tie even.
    let ord = real_digits.data.cmp(&theor_digits.data);
    shared::round::<F, _>(&mut fp, |f, s| {
        shared::round_nearest_tie_even(f, s, |is_odd, _, _| {
            // Can ignore `is_halfway` and `is_above`, since those were
            // calculates using less significant digits.
            match ord {
                cmp::Ordering::Greater => true,
                cmp::Ordering::Less => false,
                cmp::Ordering::Equal if is_odd => true,
                cmp::Ordering::Equal => false,
            }
        });
    });
    fp
}

Specifics

First, we use 64-bit limbs platforms with native 128-bit multiplication is supported or 64-bit multiplication where you can extract both the high and the low bits efficiently. In practice, this means practically every 64-bit architecture except for SPARCv8 and SPARCv9 uses 64-bit limbs. This is detected by the code compiled with gcc main.c -c -S -O3 -masm=intel.

#include <stdint.h>

struct i128 {
    uint64_t hi;
    uint64_t lo;
};

// Type your code here, or load an example.
struct i128 square(uint64_t x, uint64_t y) {
    __int128 prod = (__int128)x * (__int128)y;
    struct i128 z;
    z.hi = (uint64_t)(prod >> 64);
    z.lo = (uint64_t)prod;
    return z;
}

If the compiled code has call __multi3, then the 128-bit multiplication is emulated by GCC. Otherwise, there is native platform support. Architectures with 128-bit support include:

And architectures where 64-bit limbs are more efficient than 32-bit limbs are:

Caveats

This might be slower on 32-bit architectures or those without native 64-bit multiplication support. However, most modern architectures have native 64-bit or 128-bit multiplication.

Proof of Concept Code

A working implementation for the big integer primitives can be found here. The slow path algorithms can be found here, for positive_digit_comp and negative_digit_comp. Note that this exact implementation hasn't been fully tested, but is a minimal fork from existing lexical, so comparable code has been battle tested from nearly 3 years of use in production by millions of users.

Benchmarks

The benchmarks were run on the following values, using the fast-float-rust implementation integrated into Rust standard:

// Example large, near-halfway value.
const LARGE: &str = "8.988465674311580536566680e307";
// Example long, large, near-halfway value.
const LARGE_LONG: &str = "8.9884656743115805365666807213050294962762414131308158973971342756154045415486693752413698006024096935349884403114202125541629105369684531108613657287705365884742938136589844238179474556051429647415148697857438797685859063890851407391008830874765563025951597582513936655578157348020066364210154316532161708032e307";
// Example denormal, near-halfway value.
const DENORMAL: &str = "8.442911973260991817129021e-309";
// Example of a long, denormal, near-halfway value.
const DENORMAL_LONG: &str = "2.4703282292062327208828439643411068618252990130716238221279284125033775363510437593264991818081799618989828234772285886546332835517796989819938739800539093906315035659515570226392290858392449105184435931802849936536152500319370457678249219365623669863658480757001585769269903706311928279558551332927834338409351978015531246597263579574622766465272827220056374006485499977096599470454020828166226237857393450736339007967761930577506740176324673600968951340535537458516661134223766678604162159680461914467291840300530057530849048765391711386591646239524912623653881879636239373280423891018672348497668235089863388587925628302755995657524455507255189313690836254779186948667994968324049705821028513185451396213837722826145437693412532098591327667236328124999e-324";

These are meant to represent floats that use positive_digit_comp or negative_digit_comp.

The microbenchmark results are as follows:

core/large              time:   [13.443 us 13.464 us 13.487 us]
core/large_long         time:   [4.7838 us 4.8724 us 4.9711 us]
core/denormal           time:   [65.280 ns 65.744 ns 66.180 ns]
core/denormal_long      time:   [36.856 us 36.905 us 36.960 us]

lexical/large           time:   [371.46 ns 374.19 ns 378.37 ns]
lexical/large_long      time:   [1.1639 us 1.1745 us 1.1889 us]
lexical/denormal        time:   [55.191 ns 55.497 ns 55.901 ns]
lexical/denormal_long   time:   [799.36 ns 818.20 ns 836.18 ns]

As you can see, the big integer algorithm outperforms the decimal implementation in every case, and performs exceptionally well in a few notable edge-cases. This was tested on x86_64, so benchmarks on other architectures may be required... (specifically, 32-bit architectures and ARM-64, which may have slightly less-efficient use of 64-bit limbs).

License

I own all the copyright to the aforementioned code, and am happy to provide it, in a PR, under any license terms, including public domain. So, no licensing issues exist.

lemire commented 3 years ago

@Alexhuszagh Of course a PR is invited!!! That would be a fantastic contribution.

Alexhuszagh commented 3 years ago

OK I've just finished the minimal implementation (including correctness checks), so I'll start porting this to C++ now and also submit a PR once that's done to the Rust fork. The benchmarks once the implementation was finished is as follows (log scale on the y-axis):

denormal large

There's a few features that you may wish to cherry-pick, so I'll make the implementation multiple different commits with an explanation of each.

lemire commented 3 years ago

This looks like fantastic work!!!

You can be certain that I will review with care.