[FEATURE] Add support for `f128::f128` and `half::{ f16, bf16 }`

Evrey commented 4 years ago

Problem

Currently, lexical-core can only parse f32 and f64, but especially for designers of programming languages supporting more number formats than Rust does would be nice.

Solution

Offer a feature-gated default impl for f128 using the f128 crate and f16, bf16 from the half crate.

Prerequisites

lexical version : 0.7.*
lexical compilation features used: format, correct, radix

Alternatives

Don't see any beyond »let's not«.

Additional Context

Rust has u128, as having it for e.g. crypto is convenient, despite no mainstream CPU having 128-bit integer arithmetic and registers. f16 is very often used, e.g. in GPU code, bf16 specifically in neural network code, and f128 also finds some use here and there.

There's also 8-bit floats, though not IEEE-standardised, and there's IEEE 754 binary256. However, I know of no handy softfloat crates for these.

As lexical-core aims to be a proglang-agnostic number parser, i.e. not tied to Rust formats and types, I see no reason to completely restrict oneself to just the built-in Rust machine types.

myrrlyn commented 3 years ago

I don't know nearly enough about either the parse algorithms or the bit layout of these types to begin to take this on myself. I'll reach out to Alex and see if he has the time or capability to weigh in.

Alexhuszagh commented 3 years ago

@myrrlyn This would be doable for sure. The only issue is that the number of significant digits would have to increase dramatically for the arbitrary-precision arithmetic. Currently, in base 10, we need 768 digits to accurately represent a 64-bit float for rounding,

The formula, which you can find in comments in my source code, is as follows:

///
/// `−emin + p2 + ⌊(emin + 1) log(2, b) − log(1 − 2^(−p2), b)⌋`
///
/// For f32, this follows as:
///     emin = -126
///     p2 = 24
///
/// For f64, this follows as:
///     emin = -1022
///     p2 = 53

For f128, the exponent size is changed from 11 bits to 15 bits, so we now have the following numbers:

/// For f128, this follows as:
///    emin = −16382
///    p2 = 113

Which, would then require 11563 digits for a 128-bit float, rather than 769 for a 64-digit float. Which is why I didn't implement it originally.

Alexhuszagh commented 3 years ago

For the actual links to the documentation in the code, the comments documenting all this can be found here.

The formula so you can test it on your own is here:

def digits(emin, p2, b):
    return -emin + p2 + math.floor((emin+ 1)*math.log(2, b)-math.log(1-2**(-p2), b))

Where in the above case, emin = 16382, p = 113, and b = 10.

Alexhuszagh commented 3 years ago

Support for f16 and bf16 should be trivial, however. For f128, I"m very worried about the following however, so it would never be the default (maybe on option like radix?).

https://github.com/Alexhuszagh/rust-lexical/blob/905cdbaa7ac9f6631da8ed34a41afbcf74520f13/lexical-core/src/atof/algorithm/bhcomp.rs#L55-L103

This is because it's fairly trivial with 767 mantissa digits + 324 exponent digits to stack-allocate everything which is ~1091 digits or ~3600 bits, meanwhile, with an f128, we'd need 11563 mantissa digits + 4932 exponent digits, or ~55,000 bits. We're not stack-allocating 7KB just for a single operation, which is totally doable using a Vec, but would require an allocator.

EDIT:

This might be mitigated a bit if we change to use associated type in the Float trait for the size. We currently just overestimate because it's really not a big deal, since at max we're dealing with like 800 bytes. This might have a few benefits outside of just the specific f128 case.

Currently, our two storage options (for different correctness algorithms, which have speed tradeoffs) are as follows:

https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/atof/algorithm/bignum.rs#L9-L21 https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/atof/algorithm/bignum.rs#L118-L124

If we make it an associated type for the Float trait, we could easily lower these values for f32, keep them the same for F64, and use a Vec for f128. This should avoid any performance penalties for simpler cases and allow an f128 case without defaulting to a dynamically-allocated Vec.

Evrey commented 3 years ago

Given how often f16 and rarely f128 is used out in the wild, at least i would be totally fine with just going f16/bf16 for now. That stack memory usage for f128 does sound really expensive. That said, i do like the associated type solution.

Alexhuszagh commented 3 years ago

@Evrey What implementations do you want for f16, bf16, and f128. I see the following:

half for f16, bf16

I don't see a Rust version of f128 that doesn't rely on GCC intrinsics, which I'd rather avoid. It might definitely be possible to write our own library based off of the LLVM intrinsics, but that's rather low-level and I'd rather avoid that for now (and that would be beyond the scope of lexical, and be an external dependency).

In the meantime, I will be feature-gating the code with features that don't (yet) exist for the associated constants, and better documentation for the required storage for each type in the comments, and add associated types for the storage required.

Alexhuszagh commented 3 years ago

@RazrFalcon The preliminary logic for all of this is implemented, however, I need concrete types for anything more significant.

The following additions have been added, with appropriate types, should make everything work out-of-the-box.

In Float:

BIGINT_LIMBS
BIGFLOAT_LIMBS
BigintStorage
BigfloatStorage
EXPONENT_SIZE

I then re-implemented Bigint and Bigfloat in terms of these associated types (right now, the *_LIMBS constants are just decorative until later versions, when we can use them in ArrayVec directly).

I also added extensive comments to ensure the choices for these sizes were clear:

/// The minimum, denormal exponent can be calculated as follows: given
/// the number of exponent bits `exp_bits`, and the number of bits
/// in the mantissa `mantissa_bits`, we have an exponent bias
/// `exp_bias` equal to `2^(exp_bits-1) - 1 + mantissa_bits`. We
/// therefore have a denormal exponent `denormal_exp` equal to
/// `1 - exp_bias` and the minimum, denormal float `min_float` is
/// therefore `2^denormal_exp`.
///
/// For f16, this follows as:
///     exp_bits = 5
///     mantissa_bits = 10
///     exp_bias = 25
///     denormal_exp = -24
///     min_float = 5.96 * 10^−8
///
/// For bfloat16, this follows as:
///     exp_bits = 8
///     mantissa_bits = 7
///     exp_bias = 134
///     denormal_exp = -133
///     min_float = 9.18 * 10^−41
///
/// For f32, this follows as:
///     exp_bits = 8
///     mantissa_bits = 23
///     exp_bias = 150
///     denormal_exp = -149
///     min_float = 1.40 * 10^−45
///
/// For f64, this follows as:
///     exp_bits = 11
///     mantissa_bits = 52
///     exp_bias = 1075
///     denormal_exp = -1074
///     min_float = 5.00 * 10^−324
///
/// For f128, this follows as:
///     exp_bits = 15
///     mantissa_bits = 112
///     exp_bias = 16495
///     denormal_exp = -16494
///     min_float = 6.48 * 10^−4966
pub(super) fn max_digits<F>(radix: u32)
    -> Option<usize>
    where F: Float
{
    match F::BITS {
        32 => max_digits_f32(radix),
        64 => max_digits_f64(radix),
        _  => unreachable!(),
    }
}}

/// Storage for a big integer type.
///
/// This is used for the bhcomp::large_atof and bhcomp::small_atof
/// algorithms. Specifically, it stores all the significant digits
/// scaled to the proper exponent, as an integral type,
/// and then directly compares these digits.
///
/// This requires us to store the number of significant bits, plus the
/// number of exponent bits (required) since we scale everything
/// to the same exponent.
/// This therefore only needs the following number of digits to
/// determine the correct representation (the algorithm can be found in
/// `max_digits` in `bhcomp.rs`):
///  * `bfloat16` - 138
///  * `f16`      - 29
///  * `f32`      - 158
///  * `f64`      - 1092
///  * `f128`     - 16530
#[derive(Clone, PartialEq, Eq)]
#[cfg_attr(test, derive(Debug))]
pub(crate) struct Bigint<F: Float> {
    /// Internal storage for the Bigint, in little-endian order.
    pub(crate) data: F::BigintStorage,
}

/// Storage for a big floating-point type.
///
/// This is used for the bigcomp::atof algorithm, which crates a
/// representation of `b+h` and the float scaled into the range `[1, 10)`.
/// This therefore only needs the following number of digits to
/// determine the correct representation (the algorithm can be found in
/// `max_digits` in `bhcomp.rs`):
///  * `bfloat16` - 97
///  * `f16`      - 22
///  * `f32`      - 113
///  * `f64`      - 768
///  * `f128`     - 11564
#[derive(Clone, PartialEq, Eq)]
#[cfg_attr(test, derive(Debug))]
pub struct Bigfloat<F: Float> {
    /// Internal storage for the Bigfloat, in little-endian order.
    ///
    /// Enough storage for up to 10^345, which is 2^1146, or more than
    /// the max for f64.
    pub(crate) data: F::BigfloatStorage,
    /// It also makes sense to store an exponent, since this simplifies
    /// normalizing and powers of 2.
    pub(crate) exp: i32,
}

This should simplify maintainability while also adding support for these types later down the road. If all checks pass, I'll merge this branch into master and keep this branch until I get concrete implementation details on the floats we want to use.

Evrey commented 3 years ago

@Alexhuszagh I'm fine with not having f128 for now. I'd be nice for symmetry, but alas it is rarely used/needed. Unlike f16/bf16. half is really the only impl for that i know of, so it's the one i suggested. But if you happen to know a better one (according to whichever criteria), then have at it. Maybe multiple impls could be added if demand is there. I just need it for compiler design stuff, so i mostly treat f16 as a storage format.

Alexhuszagh commented 3 years ago

@Evrey I've just looked at the half crate, and there is no support for any mathematical operations, which prevents the fast path from being implemented. In order to implement this then, we would need to:

Create an f16, bf16, and f128 type using LLVM intrinsics (if possible).
Ensure there are conversion operations to and from half in this crate if the half feature is enabled.
Implement parsing in terms of these types.

I'm not exactly the world's best maintainer, so I would likely ask a few, well-trusted maintainers to ensure the crate is well-maintained (like with lexical) in the case my mental health deteriorates.

Evrey commented 3 years ago

Uff, okay, that's a lot of work. D: Sounds like effort is instead best put in preparation for more types, then, and pushing whatever f16 RFC may exist in that RFC ocean?

E: Btw., while GPUs do f16 math in hardware, CPUs rarely implement actual f16 math, and instead just offer hardware type conversion to/from f32. Perhaps these »fast paths« are doable with explicit f32 casts?

Alexhuszagh commented 3 years ago

@RazrFalcon I believe LLVM supports via it's intrinsics all of the ones we want. Not everything, so it would have to be feature-gated and implemented on supported hardware.

Here's documentation from LLVM on f16 and bf16 and f128. This would likely have to be unstable, since Rustc inherently checks for any LLVM intrinsics. The other, fallback solution would simply be software emulation, which could be decently fast and not too difficult. The only reason this would be doable without too much work is there's a pretty big body of permissively licensed work I believe that has robust implementations of this.

I would likely be using llvmint-like wrappers for this (since llvmint doesn't seem to be maintained).

Alexhuszagh commented 3 years ago

Ok I've taken a closer look at this, and it seems we have the following cases:

fp16, bf16 in Clang's own specifications seem to be storage-only formats. After a careful look at x86 and ARM ISAs, the only instructions seem to be conversion to and from these binary formats. For x86, AVX512 extensions seem to only support packing single-precision floats to bf16, and then an intrinsic dot-product. This does not affect anything we do here, and it seems all other operations are lossy.
- Operations using these types will therefore intrinsically be truncated, so accuracy is not exactly needed (round up to f32 and down would be perfectly valid).
_Float16 is a C-language extension that does near-native 16-bit floating point arithmetic if supported by the architecture. If it is not, then the arithmetic is performed as if it were a single-precision float, and then truncated back
f128 seems to be hardware-specific, as does ppcf128 (2x 64-bit floats for a larger dynamic range).

References for f16, bf16, and _Float16. References for f128. References for ARM half-precision types. References for x86 F16C extensions. References for x86 AVX-512 extensions for BF16.

So, after a careful look, this seems a lot more trivial than initially imagined:

Support for f16, bf16 using float promotion and truncation. This is extremely trivial, since all mathematical operations are pretty much lossy, without any need for guard digits. The fast-path would then be pretty simple, and the slower paths would likewise be trivial.
Support for _Float16 using either software-emulated floats or the LLVM intrinsics. This would lead to correct results (when supported by LLVM) for 16-bit float conversions
Support for f128 using LLVM intrinsics or software-emulated floats. This should have perfect accuracy in all cases.

In terms of accuracy... f16 and bf16 would only have rounding issues with the following part of the fast-path algorithm, due to the lack of guard digits in this operation. Since it would be promoted to an f32, processed, and then truncated, it could lead to errors of a few ULP (not sure, I'd assume at max 1 but I'd have to test):

float.pow(radix, exponent)

I'm creating an initial, proof-of-concept crate now for support for these types. It's extremely preliminary, and currently private, but I'm working on it now.

Alexhuszagh commented 3 years ago

Ok, I've thought about it a bit more carefully, and as long as we implement the logic to and from f32 properly, no loss of accuracy is possible since the fast path only operates if, and only if, the mantissa is capable of completely storing the digits (including exponents being shifted from the exponent to the mantissa).

The means, for the fast path, with:

f16: we have (-4, 4) for the max exponent range, along with 3 for the mantissa limit.
bf16, we have (-3, 3) for the max exponent range, along with 2 for the mantissa limit.

This means the fast path will only be valid for exponents [-4, 7] for f16, and [-3, 5] for bf16 where the mantissa is not truncated during parsing. This is trivial to test accuracy for: we should not have any issues here.

The code to generate these limits, in case you're wondering, can be found in the documentation:

https://github.com/Alexhuszagh/rust-lexical/blob/7010efd5e144faff0c9afbc1b22b47d9d9567966/lexical-core/src/table/decimal.rs#L86-L157

Alexhuszagh commented 3 years ago

On a separate note, I believe I've removed all the hard-coded logic internally to support f128, removing all the assumptions about an f64 mantissa used to simplify some trait impls. I needed to refactor this anyway, but it means we have most of the initial barriers removed for support for f16, bf16, and f128 (both will be feature-gated, f16 and bf16 under the feature = "f16" gate, and f128 under the feature = "f128" gate.

Evrey commented 3 years ago

This is exciting news!

Alexhuszagh commented 3 years ago

OK, I've been looking at this carefully and currently am implementing bf16 and f16 support. However, I don't see f128 support coming any time soon, just because of the complexity of the algorithms required: it breaks most of the existing logic (for parsing, it would require large pre-computed Bellerophon tables, for the slow path, it would require a massive big integer of which would require a system allocator)... Likewise, I've removed a lot of the more complicated algorithms like Karatsuba multiplication because they scale poorly for f64 and smaller values, but they're asymptotically better for f128 and it would likely make a difference.

The current number of bits required to parse a f128 correctly would be:

let max_digits = 11564;
let min_denormal_exp = 4966;

So, we'd need enough bits to store 10^16530, or ~55KB. This absolutely requires a system allocator, since we're in danger of overflowing the stack just by that on musl libc. The pre-computed Bellerophon powers wouldn't be too bad for decimal strings (~76 u128), however, the slow path would absolutely be a pain currently.

The float-writing algorithms would be a lot easier, but would require support for 256-bit integer math (not too difficult to do, since have native 128-bit math).

I'll be implementing f16 and bf16 logic, crafting a release, and then consider implementing f128 logic. This will likely first only support float writing, then later, float parsing, but the latter is unlikely to occur in the near future.

Alexhuszagh commented 3 years ago

Ok full support for f16 and bf16 has been implemented, with minimal types bf16 and f16 being publicly exported in lexical-core and lexical. These have to_bits() and from_bits() which take a u16 type, for easy conversion to-and-from other half-precision floating point libraries, without having to rely on any external dependencies.

This can be enabled using the f16 feature as of lexical v6.0.0 and lexical-core v0.8.0.

Evrey commented 3 years ago

What a lovely bulk-update!

Alexhuszagh commented 1 month ago

Linking this to #93.

Alexhuszagh commented 1 month ago

I'm going to drop considering support for f128 because there's a lot of optimizations that are effectively impossible, so using a more naive library would be ideal. It would only really make sense if using an antiquated algorithm like Grisu for floating point printing and even slow algorithms that have been removed from general support would which I have no plan on supporting right now and dealing with the preconditions and code changes to implement this.

Also, 128-bit floats aren't commonly used like the bf16 and f16 variants and I know of practically no support for them outside one that requires linking to glibc.

Evrey commented 1 month ago

Good reasoning, and still a big win on half floats. Thanks for all the effort!

Alexhuszagh / rust-lexical