aldanor / fast-float-rust

Super-fast float parser in Rust (now part of Rust core)
https://docs.rs/fast-float
Apache License 2.0
275 stars 20 forks source link

Added read_u64 optimizations to big endian. #27

Closed Alexhuszagh closed 3 years ago

Alexhuszagh commented 3 years ago

Fixes #26.

Rationale

This should produce the same byte-code, and remove all endian-dependent codepaths, given that the following are true:

  1. u64::from_le and u64::to_le are no-ops on little-endian architectures.
  2. u64::from_le and u64::to_le are very cheap on big-endian architectures.
  3. ptr::read_unaligned and ptr::write_unaligned are identical to ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>())

The first 2 are trivial to show that they're true:

For 3, we can see that read_unaligned is effectively identical to ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>()), as long as MaybeUninit compiles down to no instructions.

Using the following source, we can see they're identical (on little-endian systems):

use std::ptr;

pub fn write_u64_v1(bytes: &mut [u8], value: u64) {
    let src = &value as *const _ as *const u8;
    let dst = bytes.as_mut_ptr();
    unsafe { ptr::copy_nonoverlapping(src, dst, 8) };
}

pub fn write_u64_v2(bytes: &mut [u8], value: u64) {
    let dst = bytes.as_mut_ptr() as *mut u64;
    unsafe { ptr::write_unaligned(dst, u64::to_le(value)) };
}

pub fn read_u64_v1(bytes: &[u8]) -> u64 {
    let mut value = 0_u64;
    let src = bytes.as_ptr();
    let dst = &mut value as *mut _ as *mut u8;
    unsafe { ptr::copy_nonoverlapping(src, dst, 8) };
    value
}

pub fn read_u64_v2(bytes: &[u8]) -> u64 {
    let src = bytes.as_ptr() as *const u64;
    u64::from_le(unsafe { ptr::read_unaligned(src) })
}

Compiled with -C opt-level=3, we can see the x86_64 assembly is identical.

example::read_u64_v1:
        mov     rax, qword ptr [rdi]
        ret

example::read_u64_v2:
        mov     rax, qword ptr [rdi]
        ret

example::write_u64_v1:
        mov     qword ptr [rdi], rdx
        ret

example::write_u64_v2:
        mov     qword ptr [rdi], rdx
        ret

This also includes tests to ensure that both big-endian and little-endian systems read the bytes the same way.

Correctness Concerns

Should be non-existent, since as long as the value is read and written to the same native integer, then all the integer operations will produce the same result no matter the byte-order of the architecture. Tests using b"01234567" are included for both read_u64 and write_u64, which should confirm it produces the integer 0x3736353433323130. If we did not use to_le and from_le, we'd expect the opposite byte-order, or 0x3031323334353637 (which would correspond to bytes of b"76543210" in little-endian). In short, we've confirmed we've gotten the proper result, and we've provided a significant optimization for big-endian architectures, and simplified a few functions.

Alternatives

We could change all the masks and operations to check if the digits are correct to big-endian, however, this might require some additional effort to check correctness, and might require changes in many more locations. Since swapping the byte-order of an integer is effectively free in the grand scheme of things, this should be satisfactory.

Benchmarks

The benchmarks on big-endian are emulated via Qemu, and therefore should be taken with a grain of salt. However, the performance for little-endian systems is identical, and the (emulated) performance improves for big-endian systems.

Little-Endian (Native), read_u64

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.13    27.29    27.66    28.09    28.56    29.90    44.77 |
| lexical               75.72    76.36    76.86    77.39    78.48    80.79   100.68 |
| from_str             200.21   200.92   201.65   202.70   204.25   209.90   314.91 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            22.34    33.51    35.02    35.61    36.15    36.65    36.86 |
| lexical                9.93    12.39    12.75    12.92    13.01    13.10    13.21 |
| from_str               3.18     4.77     4.90     4.93     4.96     4.98     4.99 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           388.73   583.04   609.39   619.59   629.11   637.69   641.38 |
| lexical              172.84   215.56   221.78   224.87   226.42   227.89   229.81 |
| from_str              55.26    82.92    85.20    85.85    86.29    86.61    86.92 |
|                                                                                   |
=====================================================================================

Little-Endian (Native), master

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.12    27.30    27.66    28.01    28.41    29.42    36.89 |
| lexical               75.76    75.98    76.48    76.98    77.95    81.02    96.75 |
| from_str             200.38   201.01   201.69   202.55   204.46   209.63   230.14 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.10    34.00    35.20    35.70    36.16    36.63    36.88 |
| lexical               10.34    12.35    12.83    12.99    13.08    13.16    13.20 |
| from_str               4.35     4.77     4.89     4.94     4.96     4.98     4.99 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           471.66   591.63   612.61   621.26   629.17   637.35   641.71 |
| lexical              179.86   214.90   223.25   226.07   227.53   229.01   229.69 |
| from_str              75.61    83.04    85.11    85.91    86.28    86.57    86.84 |
|                                                                                   |
=====================================================================================

Big-Endian (powerpc-unknown-linux-gnu), read_u64

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           239.00   240.00   240.81   241.40   242.22   245.25   270.75 |
| lexical              600.54   603.88   607.95   614.02   617.52   629.30   859.01 |
| from_str            1318.93  1325.09  1328.26  1331.44  1335.09  1349.77  1497.10 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float             3.69     4.08     4.13     4.14     4.15     4.17     4.18 |
| lexical                1.16     1.59     1.62     1.63     1.64     1.66     1.67 |
| from_str               0.67     0.74     0.75     0.75     0.75     0.75     0.76 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            64.27    70.96    71.84    72.09    72.26    72.51    72.81 |
| lexical               20.26    27.67    28.18    28.34    28.62    28.82    28.98 |
| from_str              11.62    12.89    13.03    13.07    13.10    13.13    13.19 |
|                                                                                   |
=====================================================================================

Big-Endian (powerpc-unknown-linux-gnu), master

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           259.11   261.30   262.17   262.88   263.73   267.88   302.34 |
| lexical              613.42   614.97   616.32   617.60   619.06   624.77   672.53 |
| from_str            1319.05  1328.78  1351.89  1357.88  1361.90  1374.38  1481.66 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float             3.31     3.73     3.79     3.80     3.81     3.83     3.86 |
| lexical                1.49     1.60     1.62     1.62     1.62     1.63     1.63 |
| from_str               0.67     0.73     0.73     0.74     0.74     0.75     0.76 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            57.56    64.96    65.98    66.20    66.38    66.60    67.16 |
| lexical               25.87    27.86    28.11    28.18    28.23    28.30    28.37 |
| from_str              11.74    12.66    12.78    12.82    12.87    13.10    13.19 |
|                                                                                   |
=====================================================================================
aldanor commented 3 years ago

Thanks, that's great! (and read/write functions are now quite neat and elegant)

(These were the only places left from the original codebase with endian-dependent code)

Alexhuszagh commented 3 years ago

Technically, 1 more thing: in decimal.rs, the following does not need to read to the native byteorder:

let v = s.read_u64();
if !is_8digits(v) {
    break;
}
d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

This is because the data isn't parsed, only read, converted from characters to digits, and then written. If this is done on big-endian systems without a byte-swap, we would get the following for b"12345678":

let v = s.read_u64_ne(); // BE: 0x3132333435363738, LE: 0x3837363534333231
if !is_8digits(v) {      // Always false, since we've just reversed the order of the characters in `v`.
    break;
}
// Always writes [1, 2, 3, 4, 5, 6, 7, 8];
d.digits[d.num_digits..].write_u64_ne(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

In short, a fairly minor optimization could be read_u64_ne, read_u64_le, and write_u64_ne, where only read_u64_le would have a byte-swap. However, in practice, this at most omits a single, very fast instruction, on a code-path that's very rare and slow.

aldanor commented 3 years ago

So we're basically wasting two bswaps per 8 chars on BE on a slow path? Yea, can be probably ignored.

All in all, looks good, let's merge this then.