Consider integration in core and std ?

Urgau commented 3 years ago

Have you consider replacing the dec2flt algorithm in the standard library by your crate ?

I remember have seen several members of the Rust Libs Team encouraging people how have better algorithm (fast, efficient, ...) to integrate their crate in the std library.

Source of dec2flt

aldanor commented 3 years ago

@Urgau I remember this has been already discussed somewhere in rustc GH issues and as I was searching for it, I noticed a related question has been raised again just yesterday - see https://github.com/rust-lang/rust/issues/85198 for further details.

One of concerns to keep in mind: binary size; both dec2flt and the algo used here use lookup tables, but dec2flt will be a bit leaner, I believe.

Urgau commented 3 years ago

Yeah I saw this issue, it's the primary reason why I search crate that do really fast float parsing and ask you that question.

One of concerns to keep in mind: binary size; both dec2flt and the algo used here use lookup tables, but dec2flt will be a bit leaner, I believe.

I know that for certain target, like embedded object, binary size is a big deal but for vast majority of users who use Rust on computers, size is not really a big deal. I think that most people will accept +10/20 kio (just a guess not a actual number) increase for blazing performance improvement in float parsing and for size constraint environment a cfg with -Zbuild-std for disabling the fast algo and go to a slower one in exchange for binary size reduce.

If binary size is the only major concern I think you should consider making a pull request to the rust repository.

lemire commented 3 years ago

I think that what @Urgau wrote makes sense. The binary size can be important, but that's something one can tweak depending on one's needs.

Alexhuszagh commented 3 years ago

@aldanor I'm working on gradual integrations of faster float-parsing algorithms into dec2flt, although obviously, faster algorithms would be preferable if applicable (the benchmarks for this look really promising).

As detailed in this internals post, I'm looking to fix in the following order:

Fix disguised fast-path algorithms.
Fix the Bellerophon algorithm, which incorrectly requires arbitrary-precision arithmetic for the significant digits currently (making it very slow).
Implement faster, arbitrary-precision algorithms (both use iterative scaling based on powers of 2, making them very slow).

My goal is gradual changes, since obviously, it's much easier to justify these changes with minimal differences to binary size but major overall performance improvements. Especially since @hanna-kruppe, the author of the float parsing algorithms, has retired from Rust development, meaning people reviewing these changes are less familiar with the codebase or float-parsing algorithms in general.

Any feedback however would be greatly appreciated.

EDIT In conversations a few years ago with @hanna-kruppe, it was very clear that both binary size was important, and there was at the time a tepid response to moving away from Big32x40 (which can only hold ~385 digits, which is insufficient for a lot of floating-point cases).

Alexhuszagh commented 3 years ago

I should add, I've got a working repository for implementing these changes into Rust's core library here, and the different branches provide improvements to different components of dec2flt.

core: A minimal, extracted version identical to Rust's dec2flt implementation.
disguised: Handle disguised fast-path cases for better performance.
infer: Reduce binary sizes by inferring the binary exponent, rather than explicitly store them (no performance penalty).
moderate: Fix major performance penalties in the Bellerophon algorithm.

I'll be working on slow-path algorithms later, and my goal is a quick but comprehensive fix of dec2flt in a way that should minimize the number of changes made initially, which hopefully should ease the burden of integrating these changes. I'm wholly amenable to making small, incremental changes and then having all improvements replaced by integration of this library, and if I can help in any way, let me know.

lemire commented 3 years ago

@Alexhuszagh

One point to consider is that the current library can effectively act as a fast path. That is, there is no need to throw away anything but maybe an existing fast path. So it can be non-invasive (and also, easily made optional).

In the Go runtime, the equivalent of the code in this repo. is running. All they did was to remove an existing fast path and replace it with a new one.

It is also relatively thin. Here is the gist of the code...


pub fn compute_float<F: Float>(q: i64, mut w: u64) -> AdjustedMantissa {
    let am_zero = AdjustedMantissa::zero_pow2(0);
    let am_inf = AdjustedMantissa::zero_pow2(F::INFINITE_POWER);
    let am_error = AdjustedMantissa::zero_pow2(-1);

    if w == 0 || q < F::SMALLEST_POWER_OF_TEN as i64 {
        return am_zero;
    } else if q > F::LARGEST_POWER_OF_TEN as i64 {
        return am_inf;
    }
    let lz = w.leading_zeros();
    w <<= lz;
    let (lo, hi) = compute_product_approx(q, w, F::MANTISSA_EXPLICIT_BITS + 3);
    if lo == 0xFFFF_FFFF_FFFF_FFFF {
        let inside_safe_exponent = (q >= -27) && (q <= 55);
        if !inside_safe_exponent {
            return am_error;
        }
    }
    let upperbit = (hi >> 63) as i32;
    let mut mantissa = hi >> (upperbit + 64 - F::MANTISSA_EXPLICIT_BITS as i32 - 3);
    let mut power2 = power(q as i32) + upperbit - lz as i32 - F::MINIMUM_EXPONENT;
    if power2 <= 0 {
        if -power2 + 1 >= 64 {
            return am_zero;
        }
        mantissa >>= -power2 + 1;
        mantissa += mantissa & 1;
        mantissa >>= 1;
        power2 = (mantissa >= (1_u64 << F::MANTISSA_EXPLICIT_BITS)) as i32;
        return AdjustedMantissa { mantissa, power2 };
    }
    if lo <= 1
        && q >= F::MIN_EXPONENT_ROUND_TO_EVEN as i64
        && q <= F::MAX_EXPONENT_ROUND_TO_EVEN as i64
        && mantissa & 3 == 1
        && (mantissa << (upperbit + 64 - F::MANTISSA_EXPLICIT_BITS as i32 - 3)) == hi
    {
        mantissa &= !1_u64;
    }
    mantissa += mantissa & 1;
    mantissa >>= 1;
    if mantissa >= (2_u64 << F::MANTISSA_EXPLICIT_BITS) {
        mantissa = 1_u64 << F::MANTISSA_EXPLICIT_BITS;
        power2 += 1;
    }
    mantissa &= !(1_u64 << F::MANTISSA_EXPLICIT_BITS);
    if power2 >= F::INFINITE_POWER {
        return am_inf;
    }
    AdjustedMantissa { mantissa, power2 }
}

Alexhuszagh commented 3 years ago

@lemire I've mostly been using the term "moderate" path for that, to disambiguate from the term "fast path" used in dec2flt. Since the above uses an extended-precision representation of the significant digits of the float, it would replace the Bellerophon algorithm, which is used if fast_path cannot be used. Rust's dec2flt has a function named fast_path which only works in cases where both the significant digits and the exponent can exactly be represented as native floats (see below). In my opinion, the whole thing should be thrown out and replaced with the above code sample, but I'm working on providing minimal changes with major performance tweaks in the meantime.

https://github.com/rust-lang/rust/blob/28e2b29b8952485679367cc05699fb5154f4e5c3/library/core/src/num/dec2flt/algorithm.rs#L109-L144

Alexhuszagh commented 3 years ago

Oh I should probably add: Rust's dec2flt doesn't parse the first 19-20 significant digits of the float except in the fast path, which is a major performance issue and trivial to fix (and then your algorithm could then be implemented into Rust's core library). The fast path just sees if there is less than 16-digits in the float's integral and fractional components, and then checks if these significant digits can be exactly represented as a native float.

If not, it defaults to creating a big integer from the significant digits, and then calls it's unoptimized Bellerophon algorithm, which can lead to extreme performance hits for no good reason.

In addition, the subsequent algorithm_r ideally prefers a value of z0 that is within 1 ULP of the correct value. This is trivial to implement, using your above algorithm, but would of course have to be modified for that purpose.

lemire commented 3 years ago

@Alexhuszagh

Right.

You have Clinger's fast path which is very cheap and handles things like (small) integers and so forth.

Otherwise, what you want to do is to grab, say, the first 19 digits. Most of the time, numbers won't have more than 17 digits. If you have no more than 19 digits, then the code above is good enough 99.99999% of the time, and even if you have more than 19 digits, the truncation is good enough almost all of the time. See section 11... https://arxiv.org/pdf/2101.11408.pdf

So it should be uncommon that you ever need to materialize a big integer.

Alexhuszagh commented 3 years ago

Yep. My modifications keep Clinger's algorithm (since I don't believe anyone currently on the Rust core team has worked on dec2flt before), but moves significant digit parsing outside of the fast path, and adds error calculation for truncated values, and only after that do we use their big-integer algorithms (which have major issues in nearly every facet, but that's a different story).

Ideally, we'd remove it, and replace it with your code, but I'm trying minor, incremental improvements at first. Ideally, my goals right now are:

Fix disguised fast path cases.
Improve the Bellerophon algorithm, to handle subnormal floats and avoid needlessly using big-integers.
Fix the glacially slow algorithm_m and algorithm_r with easy-to-implement slow-path algorithms.
Increase the size of the big-integer storage, so it can handle more than 385 digits (since we need up to 767, which means a lot of floats cannot be parsed by Rust).
Remove the last remaining Clinger algorithm, Bellerophon, and replace it with the Eisel-Lemire algorithm based on the code above.

Urgau commented 3 years ago

Ideally, we'd remove it, and replace it with your code, but I'm trying minor, incremental improvements at first. Ideally, my goals right now are:

1. Fix disguised fast path cases.

2. Improve the Bellerophon algorithm, to handle subnormal floats and avoid needlessly using big-integers.

3. Fix the glacially slow `algorithm_m` and `algorithm_r` with easy-to-implement slow-path algorithms.

4. Increase the size of the big-integer storage, so it can handle more than 385 digits
   (since we need up to 767, which means a lot of floats cannot be parsed by Rust).

5. Remove the last remaining Clinger algorithm, Bellerophon, and replace it with the Eisel-Lemire algorithm
   based on the code above.

From an outside point of view: Wouldn't be it simpler and faster to simply replace the actual rust code with the code of fast-float ?

As I see things, there is actual many faults in the rust std implementation, and you are proposing to do a 5 steps improvement (but not a total fix ?). I think that instead of trying to patch the std rust code to ideally be perfect we should take the fastest road and completely replace the actual suboptimal and broken (apparently) by a working one. In this case fast-float or another crate if it fits better the need of the std library.

Alexhuszagh commented 3 years ago

@Urgau Sure, if the maintainers are up for that. There's a few issues mentioned though, and that is:

The original implementer is no longer working on Rust, so no one is familiar with the codebase I believe.
There's a few assumptions internally that need to be addressed prior to making these changes.
The changes required here are relatively trivial, and therefore easy to justify, prior to making larger changes.

If you can justify the changes, that would be ideal. I'm going to test the binary sizes, but this crate should be pretty lean (likely leaner than core), actually. Also, sure it's not a total fix, but it's an improvement of like 10,000x for some floats, which is not insignificant.

Alexhuszagh commented 3 years ago

Ok I've run the numbers and the binary sizes are practically identical, albeit slightly smaller.

fast-float

opt-level	size	size(stripped)
0	3.3M	324K
1	3.3M	300K
2	1.3M	248K
3	1.3M	252K
s	1.3M	244K
z	1.3M	248K

core

opt-level	size	size(stripped)
0	3.6M	360K
1	3.5M	316K
2	1.3M	236K
3	1.3M	248K
s	1.3M	244K
z	1.3M	248K

If you can get the core team to approve of it, it would be a great addition to the core library. I'm still going to try to get incremental improvements implemented, for the reasons above (IMO, it's easier to justify), but ideally, this should be the end goal.

You would likely, however, have to use the [bignum[(https://github.com/rust-lang/rust/blob/master/library/core/src/num/bignum.rs) implementation for the Decimal struct.

As an FYI: I would recommend forking my dec2flt fork of Rust's num module if you're going to do this, since it allows you to implement these changes within the context of the Rust core library without having to build Rust from scratch every time. It's a minimal fork of the dec2flt library, in the context of the Rust core library, and should make it quick and easy to make changes, diffs, and then incorporate them back into the core library.

Urgau commented 3 years ago

@Urgau Sure, if the maintainers are up for that.

Posted an question on the zulip stream for T-Libs.

Ok I've run the numbers and the binary sizes are practically identical, albeit slightly smaller.

That great news.

As an FYI: I would recommend forking my dec2flt fork of Rust's num module if you're going to do this, since it allows you to implement these changes within the context of the Rust core library without having to build Rust from scratch every time. It's a minimal fork of the dec2flt library, in the context of the Rust core library, and should make it quick and easy to make changes, diffs, and then incorporate them back into the core library.

Thanks for the tips but I'm not sure I'm qualified to try doing the integration in std. In fact I barely understand how float are store internally. I will read the academic paper but I'm not sure it will be sufficient.

Urgau commented 3 years ago

@Alexhuszagh I have receive an response from Josh Triplett (who is co-lead of the Rust Libs Team). He said here:

[..] this seems like a great idea to me [...]

So if you want try to do the pull request replacing the dec2flt with the fast-float code, go for it.

There is however an legal concern also raised by Josh.

fast-float-rust says it's directly based on the C++ library. Is the Rust implementation derived from the C++ code or the algorithm in the paper? Asking because the C++ code is just Apache-2.0, but the Rust code says MIT/Apache-2.0.

If the Rust code were a derived work of the C++, it wouldn't be able to add the MIT dual license. (And the C++ code has had third-party contributions, so it isn't just the original author who would need to grant permission for that.)

@aldanor @lemire Would you mind clarifying for Josh Triplett the situation here or directly on zulip concerning the licenses.

Also I have being acting like your were okay for your implementation to be part of the rust standard library without never asking you if you were okay for that. So do you agree for your implementation to be part of the rust standard library ?

lemire commented 3 years ago

I have zero ownership of anything. I am just an interested fellow.

lemire commented 3 years ago

If anyone wants to chat about the algorithm, I am available (via Zoom). I did not write a single code of Rust for this project though. I am just around to help.

Alexhuszagh commented 3 years ago

Thanks for the tips but I'm not sure I'm qualified to try doing the integration in std. In fact I barely understand how float are store internally. I will read the academic paper but I'm not sure it will be sufficient.

I was more so responding to the author of the crate in this section, sorry if that wasn't clear.

aldanor commented 3 years ago

I'm also "just an interested fellow", in a way - as Daniel is the true owner of the algorithm, and I was porting the C++ implementation to Rust (with some minor implementational changes and updates here and there) while also having the paper by my side as a reference.

If there are any license concerns - we can sure resolve them; I'm pretty sure all of us are interested first and foremost in this to succeed. We can either make the original library Apache2/MIT (which would require all contributors to sign off on that if I understand it correctly), or make this crate Apache2-only. Not a big deal.

If Josh and the rest of the core team are happy to go ahead with this - should we give it a go? I'd be happy to have a separate zoom/zulip/whatever discussion if needed. Never had experience of contributing anything major to rustc myself, but as long as there's some guidance re: what needs to go where, I think I can try.

Alexhuszagh commented 3 years ago

@aldanor Rust is dual-licensed under both MIT/Apache-2.0, so I believe keeping the original license intact. I'm more than happy to have a Zoom call, if needed, or chat on Zulip. I've had a failed attempt (personal reasons forced me to take a step back) previously at integrating changes into the core library, but I'm tangentially familiar with the process. I would be more than happy to contribute any help I can.

Alexhuszagh commented 3 years ago

~~Looking carefully at the implementation, there's likely a few things that would additionally need to be done to integrate this into Rust core.~~

~~Validate digits occur after the exponent symbol.~~
Ensure there are <a href=https://github.com/aldanor/fast-float-rust/blob/4680af7666df5b341c6bc46ab7a112ce1adea615/src/number.rs#L122-L166">significant digits at all. Rust does not allow floats of the format e10, .e10, \, +, or .. You currently only handle + and \.
~~Case-sensitive matches on special values.~~

~~A simple playground of all these examples is here.~~

Urgau commented 3 years ago

Looking carefully at the implementation, there's likely a few things that would additionally need to be done to integrate this into Rust core.

Validate digits occur after the exponent symbol.

Ensure there are significant digits at all. Rust does not allow floats of the format e10, .e10, \, +, or .. You currently only handle + and \.

These doesn't seems to be allowed by the IEEE 754 standard either.

Case-sensitive matches on special values.

This was a bug on the implementation, that have been fix on nightly some weeks ago. See here and also here.

Alexhuszagh commented 3 years ago

Ah cool, then only 1). and 2). are necessary.

aldanor commented 3 years ago

Hmm..

fn main() {
    for s in &[
        "e10", "1.23", "1.23e", "1.", ".1", "1.23e5", ".", ".e1", "0e", "0e1",
    ] {
        println!("{}", s);
        println!("  std: {:?}", s.parse::<f64>());
        println!("  ff:  {:?}", fast_float::parse::<f64, _>(s));
    }
}

yields (as of the current version)

e10
  std: Err(ParseFloatError { kind: Invalid })
  ff:  Err(Error)
1.23
  std: Ok(1.23)
  ff:  Ok(1.23)
1.23e
  std: Err(ParseFloatError { kind: Invalid })
  ff:  Err(Error)
1.
  std: Ok(1.0)
  ff:  Ok(1.0)
.1
  std: Ok(0.1)
  ff:  Ok(0.1)
1.23e5
  std: Ok(123000.0)
  ff:  Ok(123000.0)
.
  std: Err(ParseFloatError { kind: Invalid })
  ff:  Err(Error)
.e1
  std: Err(ParseFloatError { kind: Invalid })
  ff:  Err(Error)
0e
  std: Err(ParseFloatError { kind: Invalid })
  ff:  Err(Error)
0e1
  std: Ok(0.0)
  ff:  Ok(0.0)

Alexhuszagh commented 3 years ago

Nevermind, I missed this: *s = start. My fault. So then we look good to go.

Urgau commented 3 years ago

@aldanor I have discuss with Josh and the only way forward with license is to have fast_float be dual license under MIT/APACHE.

I have open a pull request in the fast_float repository to change the license.

If Josh and the rest of the core team are happy to go ahead with this - should we give it a go? I'd be happy to have a separate zoom/zulip/whatever discussion if needed. Never had experience of contributing anything major to rustc myself, but as long as there's some guidance re: what needs to go where, I think I can try.

Yep, you should give it a try. You should also watch the rustc-dev-guide, there are some very useful information about compiling rustc (shouldn't be necessary) and the standard library.

The files that are relevant for you are under the library/core/src/num/dev2flt directories. From what I can see all of the files except mod.rs who needs to be modify can be deleted/replaced with your fast_float port.

lemire commented 3 years ago

+1

Alexhuszagh commented 3 years ago

Looking at the commits for fast-float, only @biojppm and @kitaisreal made commits prior to this port being made, and so would likely be the only ones who would need to agree to re-license their code so an (older) version of fast-float could be dual licensed, and therefore this crate dual-licensed.

We could ask permission from both of them. kitaisreal's contributions are trivial, so we could likely remove those changes if licensing becomes a major issue.

EDIT: Their name is "kitaisreal".

lemire commented 3 years ago

I have emailed @biojppm

Alexhuszagh commented 3 years ago

I've got on my own fork all the data tests from Rust's core library, and I'll be running them tonight on a little-endian (32-bit) and big-endian (32-bit) architecture, since those are the only cases that haven't yet been comprehensively tested IIRC. None of the logic should affect these cases, but it's nice to be extra sure.

I'll be running this branch with all the unittests within for the following two architectures:

armv7-unknown-linux-gnueabihf (32-bit, LE)
powerpc-unknown-linux-gnu (32-bit, BE)

This is going to take a while, since these tests basically trigger the slow algorithm every time, but it shouldn't be too bad.

Alexhuszagh commented 3 years ago

We have correctness issues on PowerPC 32-bit, BE. I'm going to run the tests again, but here's a few examples:

0.5000309541037827 ULP error on 1303e-20 (f64)
0.5000309541037827 ULP error on 2606e-20 (f64)
0.5000309541037827 ULP error on 5212e-20 (f64)

I'm going to check the cases to see if it could possibly be a soft-float emulation issue (these issues do happen, I had to open a few patches in Qemu before). I'll have some specialized tests going to triage this, and I'll get back to you.

armv7-unknown-linux-gnueabihf ran without issue. All of these are examples of fast-path cases (and not disguised fast-path cases either), so it's likely this is a soft float issue. A lot of stuff uses soft floats on PPC (due to the awful performance of hard floats on PPC makes soft floats faster), so this might not even be a Qemu issue, it might just be a PPC issue. A trivial way to check would be to try these same floats compiled for PPC as well.

Alexhuszagh commented 3 years ago

Ok I've triaged the issue.

I've added the following unnittests:

use fast_float::parse;

#[test]
fn test_f64_soft_float() {
    assert_eq!(
        parse::<f64, _>("1303e-20").unwrap(),
        f64::from_bits(0b11110001101110000010111000110111101101101111110001101110101000)
    );
    assert_eq!(
        parse::<f64, _>("2606e-20").unwrap(),
        f64::from_bits(0b11110001111110000010111000110111101101101111110001101110101000)
    );
    assert_eq!(
        parse::<f64, _>("5212e-20").unwrap(),
        f64::from_bits(0b11110010001110000010111000110111101101101111110001101110101000)
    );
}

And then compile analogous to this:

RUSTFLAGS='-C soft-float' cargo test
RUSTFLAGS='-C soft-float' cross test --target powerpc-unknown-linux-gnu
RUSTFLAGS='-C soft-float' cross test --target armv7-unknown-linux-gnueabihf

The following targets fail:

powerpc64le-unknown-linux-gnu
powerpc-unknown-linux-gnu

The following targets pass:

armv7-unknown-linux-gnueabihf (with and without the -C soft-float option.
i686-unknown-linux-gnu (with and without the -C soft-float option.
mips-unknown-linux-gnu

So it's not an endian issue, it seems to be a PowerPC issue. Looking at the bit patterns of the power (e-17), using the pow10_fast_path we get:

// Code
// println!("e17={:?}", (f64::pow10_fast_path(17)).to_bits());
// println!("e-17={:?}", (1.0/f64::pow10_fast_path(17)).to_bits());

// MIPS, ARMv7, etc.
e17=4861130398305394688
e-17=4352464011485697175

// PowerPC
e17=4861130398305394688
e-17=4352464011485697175

So the issue doesn't manifest in simple cases. When doing a little more poking, we get:

// Code
// println!("1.303e17={:?}", (1.303 * f64::pow10_fast_path(17)).to_bits());
// println!("1.303e-17={:?}", (1.303/f64::pow10_fast_path(17)).to_bits());

// MIPS, ARMv7, etc.
1.303e17=4863024148305394688
1.303e-17=4354430593920867240

// PowerPC
1.303e17=4863024148305394688
1.303e-17=4354430593920867240

Which is likewise identical. However, we get a 1 ULP error with the following:

// Code
// println!("1e20={:?}", (1.0 * f64::pow10_fast_path(20)).to_bits());
// println!("1e-20={:?}", (1.0/f64::pow10_fast_path(20)).to_bits());
// println!("1303e20={:?}", (1303.0 * f64::pow10_fast_path(20)).to_bits());
// println!("1303e-20={:?}", (1303.0/f64::pow10_fast_path(20)).to_bits());

// PowerPC
1e20=4906019910204099648
1e-20=4307583784117748259
1303e20=4952718876067038006
1303e-20=4354430593920867241

This seems to be an error in float division on PowerPC systems. I'm going to quickly check if other systems have similar issues.

Alexhuszagh commented 3 years ago

This seems to be a PPC bug or Qemu bug. Not too worried about it then. I used both -mhard in GCC and the -C soft-float in rustc and both produced the same error. It's either an artifact of Qemu, in which case, who cares. It could also be a bug in PPC, in which case, who cares. It's not an issue with the library.

#include <inttypes.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

int main() {
  double num = 1303.0;
  double den = 1e20;
  double value = num / den;
  uint64_t bits;
  memcpy(&bits, &value, 8);
  printf("%" PRIu64 "\n", bits);

  return 0;
}

# powerpc-linux-gnu-gcc main.c -o main -std=c99
# qemu-ppc ./main
4354430593920867241

So.... I think we're good to go... 🎉🎉

Urgau commented 3 years ago

So.... I think we're good to go... 🎉🎉

Great! Happy to see this moving so fast.

lemire commented 3 years ago

I have an actual POWER9 processor. I can give you guys access, just email me at lemire@gmail.com.

I get ...

$ gcc main.c -o main -std=c99
$ ./main
4354430593920867240

This is an actual POWER9 system.

lemire commented 3 years ago

@Alexhuszagh I have an actual POWER9 machine. I can either run the tests for you if you give me the command lines you want me to run (ideally within a docker container) or else you can email me at lemire@gmail.com and I can give you ssh access.

Alexhuszagh commented 3 years ago

@lemire I think we're good then. Seems like an artifact. Can you create the assembly using:

gcc -S main.c
more main.s

That's all I really need to know. All the magic numbers seem to be correct, so I think this is just a Qemu bug (time for me to file another issue). I'd expect something with the following:

; The actual load, fdiv, and store instructions
; ....

; The magic numbers, this is what we'd expect. LC0 is 
; 1303.0, and LC1 is 1e20, just in two 32-bit words.

; 1303.0 == transmuting ((1083464704 << 32) | 0)
.LC0:
        .long   0          
        .long   1083464704 
        .align 8

; 1e20 == transmuting ((1142271773 << 32) | 2025163840)
.LC1:
        .long   2025163840 
        .long   1142271773

I'm generating that assembly, there doesn't seem to be a precision flag (there's nothing weird like the x87, this is all handled correctly by the f* instructions (fdiv for 64-bit, fdivs for 32-bit).

x87 weirdness Screenshot_20210515_092708

lemire commented 3 years ago

I am not an expert in POWER assembly, but here is the dump. There is the fdiv instruction. It is hard to read, but if I compile just...

double divide(double num, double den) {
 return num / den;
}

I get effectively a single instruction...

fdiv 1,1,2

Note that I would be very surprised (and disappointed) if POWER processors did not follow the IEEE standard and common practices to the T. IBM takes standards seriously.

$ more main.s
    .file   "main.c"
    .abiversion 2
    .section    ".text"
    .section    .rodata
    .align 3
.LC2:
    .string "%lu\n"
    .section    ".text"
    .align 2
    .globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
.LCF0:
0:  addis 2,12,.TOC.-.LCF0@ha
    addi 2,2,.TOC.-.LCF0@l
    .localentry main,.-main
    mflr 0
    std 0,16(1)
    std 31,-8(1)
    stdu 1,-144(1)
    .cfi_def_cfa_offset 144
    .cfi_offset 65, 16
    .cfi_offset 31, -8
    mr 31,1
    .cfi_def_cfa_register 31
    addis 9,2,.LC0@toc@ha
    addi 9,9,.LC0@toc@l
    lfd 0,0(9)
    stfd 0,96(31)
    addis 9,2,.LC1@toc@ha
    addi 9,9,.LC1@toc@l
    lfd 0,0(9)
    stfd 0,104(31)
    lfd 12,96(31)
    lfd 0,104(31)
    fdiv 0,12,0
    stfd 0,112(31)
    ld 9,112(31)
    std 9,120(31)
    ld 9,120(31)
    mr 4,9
    addis 3,2,.LC2@toc@ha
    addi 3,3,.LC2@toc@l
    bl printf
    nop
    li 9,0
    mr 3,9
    addi 1,31,144
    .cfi_def_cfa 1, 0
    ld 0,16(1)
    mtlr 0
    ld 31,-8(1)
    blr
    .long 0
    .byte 0,0,0,1,128,1,0,1
    .cfi_endproc
.LFE0:
    .size   main,.-main
    .section    .rodata
    .align 3
.LC0:
    .long   0
    .long   1083464704
    .align 3
.LC1:
    .long   2025163840
    .long   1142271773
    .ident  "GCC: (Debian 8.3.0-6) 8.3.0"
    .section    .note.GNU-stack,"",@progbits

Alexhuszagh commented 3 years ago

@lemire The magic numbers (which is what's important, the rest is just effectively load, store, and division instructions) are all the same: this is a Qemu bug. Thanks for verifying. Further triaging, it seems this was fixed in Qemu-4.2.1, so this is specifically a cross issue that I've filed.

As far as correctness goes, as long as we use the FPU precision flags for x87 that Rust already has, we're good to go!

Urgau commented 3 years ago

My pull request for dual licensing fast_float has been merged. We're now also go to go with licensing. 🎉

Alexhuszagh commented 3 years ago

I've expanded my tests by the way to:

powerpc-unknown-linux-gnu
armv7-unknown-linux-gnueabihf
x86_64-unknown-linux-gnu
s390x-unknown-linux-gnu
mips64el-unknown-linux-gnuabi64
mips-unknown-linux-gnu
arm-unknown-linux-gnueabihf
i686-unknown-linux-gnu
aarch64-unknown-linux-gnu
powerpc-unknown-linux-gnu
armv7-unknown-linux-gnueabihf
x86_64-unknown-linux-gnu
riscv64gc-unknown-linux-gnu

I think we can guarantee that we have complete correctness on every Tier 1 and Tier 2 target.

aldanor commented 3 years ago

Here's another concern I have: let's say we integrate this into core (and maybe, although doubtfully so at the moment, we even get from_parts() API in core).

What happens to parse_partial()? This is completely irreplaceable in many parsing scenarios, and that's really the main entry point here.

Alexhuszagh commented 3 years ago

@aldanor Pretty good reason to maintain the library, IMO. One of my realizations early on with lexical was that there was a number of use-cases that would never make it into core, so I think there's a good rationale for the continued existence of this library from a generalized parsing perspective. Still, numerous users don't use 3rd-party crates for a lot of functionality, whether by choice or simply due to a lack of awareness. The number of people who would therefore benefit from such inclusion would be massive.

aldanor commented 3 years ago

Yea, that's my thoughts too. Coexistence of two (and perhaps slightly different) versions of the same code to be maintained is not the most cheerful prospect of course... Although if from_parts() is accepted, then at least that part wouldn't have to be duplicated.

Urgau commented 3 years ago

@Alexhuszagh @aldanor I plan to write an update on zulip, but there have been no interaction is the past week; is there any update regarding the integration in core ?

Also, do you need help and/or mentoring ? If that's the case I can ask on zulip.

Alexhuszagh commented 3 years ago

@Urgau I've been doing some enhancements on this crate, fast_float (the C++ implementation), and my own crate lexical. I'm more than happy to spearhead the integration into core if that's ok @aldanor? I was going to defer to you, since it's your crate, but if you're fine with me taking the lead, I will gladly put in the work.

aldanor commented 3 years ago

@Urgau @Alexhuszagh I was planning to start looking into that in detail this weekend, unfortunately I didn't have time up to this point for personal reasons :/

@Alexhuszagh I won't mind of course, that sounds great, and you have more experience with num/dec2flt since you've already played around with it. I would be glad to participate in reviewing then.

Also: if there's any other enhancements you had in mind (like endian stuff etc), perhaps now is the best time to do as much of that as possible on this crate before integration into core so this crate and its future stdlib version would diverge as little as possible?

Alexhuszagh commented 3 years ago

@aldanor My last changes (I've been looking at both this crate and the C++ implementation extensively) are fixed by my open PRs. I have no further improves planned, except changes to add x87 support when integrating into core (which is impossible on a 3rd party crate).

So if all is good, I'm more than happy to spearhead this. My endian changes have been merged in fast_float, and once they're merged here, I believe we should be good to go.

Urgau commented 3 years ago

@Urgau @Alexhuszagh I was planning to start looking into that in detail this weekend, unfortunately I didn't have time up to this point for personal reasons :/

No problem, take your time.

@Alexhuszagh I won't mind of course, that sounds great, and you have more experience with num/dec2flt since you've already played around with it. I would be glad to participate in reviewing then.

This seems to be a good idea.

Also: if there's any other enhancements you had in mind (like endian stuff etc), perhaps now is the best time to do as much of that as possible on this crate before integration into core so this crate and its future stdlib version would diverge as little as possible?

Should we consider having this crate as an subtree ? so that development can still continue here and do subtree update once in a while.

So if all is good, I'm more than happy to spearhead this. My endian changes have been merged in fast_float, and once they're merged here, I believe we should be good to go.

That's great news. 😀

Alexhuszagh commented 3 years ago

I've set up a tracking branch then for the changes here. I'll be using this to create a diff I'll then apply to rust-lang proper (to avoid re-compiling rust every time). There's going to have to be one significant change from the source code (which isn't actually that significant, in the grand scheme of things):

AdjustedMantissa is effectively just diy_float::Fp, which is also used by Grisu.

This is a pretty trivial fix, but we do need to be careful. The exponent in diy_float::Fp is not biased, while AdjustedMantissa is.

aldanor / fast-float-rust

Consider integration in core and std ? #15