leni536 commented 1 year ago

Use right-sized uint type for bit fiddling
- This removes the need to special casing on endianness
Replace ternary with just shifting the sign at the right place
- This seems to improve codegen (less instructions, no cmov)

Codegen demonstration:

gcc 12: https://godbolt.org/z/ceYK4crde clang 15: https://godbolt.org/z/WTcqn8xGv MSVC 19.14: https://godbolt.org/z/P3d8dv9Eq

lemire commented 1 year ago

Looks promising !!!

leni536 commented 1 year ago

I did some benchmarking using https://github.com/lemire/simple_fastfloat_benchmark

main branch (b1d27734c5443c91847a8fc36e0e32ef23fc9f38):

$ build/benchmarks/benchmark
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB 
netlib                                  :   307.67 MB/s (+/- 1.3 %)    14.66 Mfloat/s  
doubleconversion                        :   264.45 MB/s (+/- 6.6 %)    12.60 Mfloat/s  
strtod                                  :   226.77 MB/s (+/- 0.3 %)    10.81 Mfloat/s  
abseil                                  :   516.82 MB/s (+/- 0.2 %)    24.63 Mfloat/s  
fastfloat                               :  1387.79 MB/s (+/- 0.3 %)    66.15 Mfloat/s

$ build/benchmarks/benchmark32 
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB 
strtof                                  :   207.98 MB/s (+/- 0.3 %)     9.91 Mfloat/s  
abseil                                  :   514.02 MB/s (+/- 0.9 %)    24.50 Mfloat/s  
fastfloat                               :  1502.88 MB/s (+/- 0.3 %)    71.63 Mfloat/s

this PR (https://github.com/leni536/fast_float/commit/6d2fb68f5c928f75642762b235875cfbbd807687):

$ build/benchmarks/benchmark
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB 
netlib                                  :   306.21 MB/s (+/- 0.2 %)    14.59 Mfloat/s  
doubleconversion                        :   273.84 MB/s (+/- 6.6 %)    13.05 Mfloat/s  
strtod                                  :   226.31 MB/s (+/- 0.9 %)    10.79 Mfloat/s  
abseil                                  :   513.81 MB/s (+/- 1.6 %)    24.49 Mfloat/s  
fastfloat                               :  1486.56 MB/s (+/- 0.3 %)    70.85 Mfloat/s  

$ build/benchmarks/benchmark32 
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB 
strtof                                  :   207.30 MB/s (+/- 0.4 %)     9.88 Mfloat/s  
abseil                                  :   532.52 MB/s (+/- 0.3 %)    25.38 Mfloat/s  
fastfloat                               :  1485.83 MB/s (+/- 0.4 %)    70.82 Mfloat/s

compiler: gcc-12 (Debian 12.2.0-13) 12.2.0 CPU: AMD Ryzen 5 3600

It looks like double conversion throughput increased around 7%, but float conversion throughput decreased around 1%.

It might be worth checking the two code changes (right-sized uint and ternary) independently. I'm happy to do that soon.

lemire commented 1 year ago

@leni536 Can you run this benchmark in privileged mode (e.g., with sudo) so that you get the performance counters?

lemire commented 1 year ago

Here is what I get on my macbook (ARM M2, LLVM 14).

64-bit benchmark

Main:

fastfloat                               :  1879.51 MB/s (+/- 0.7 %)    89.58 Mfloat/s      13.00 i/B   286.10 i/f (+/- 0.0 %)      1.73 c/B    38.07 c/f (+/- 0.4 %)      7.52 i/c      3.41 GHz

Your PR

fastfloat                               :  1887.33 MB/s (+/- 0.6 %)    89.96 Mfloat/s      13.00 i/B   286.11 i/f (+/- 0.0 %)      1.72 c/B    37.91 c/f (+/- 0.4 %)      7.55 i/c      3.41 GHz

32-bit benchmark

Main:

fastfloat                               :  1948.01 MB/s (+/- 3.5 %)    92.85 Mfloat/s      12.50 i/B   275.07 i/f (+/- 0.0 %)      1.67 c/B    36.67 c/f (+/- 0.9 %)      7.50 i/c      3.41 GHz

Your PR

fastfloat                               :  2003.42 MB/s (+/- 4.2 %)    95.49 Mfloat/s      12.46 i/B   274.07 i/f (+/- 0.0 %)      1.67 c/B    36.72 c/f (+/- 0.9 %)      7.46 i/c      3.51 GHz

lemire commented 1 year ago

If I use the command./build/benchmarks/benchmark -f data/canada.txt, I get...

Main branch...

fastfloat                               :  1414.85 MB/s (+/- 1.5 %)    81.31 Mfloat/s      16.15 i/B   294.68 i/f (+/- 0.0 %)      2.35 c/B    42.95 c/f (+/- 1.0 %)      6.86 i/c      3.49 GHz

This PR

fastfloat                               :  1424.45 MB/s (+/- 1.5 %)    81.86 Mfloat/s      16.15 i/B   294.68 i/f (+/- 0.0 %)      2.34 c/B    42.77 c/f (+/- 1.2 %)      6.89 i/c      3.50 GHz

leni536 commented 1 year ago

@leni536 Can you run this benchmark in privileged mode (e.g., with sudo) so that you get the performance counters?

64 bit

main branch (b1d27734c5443c91847a8fc36e0e32ef23fc9f38):

fastfloat                               :  1390.38 MB/s (+/- 0.3 %)    66.27 Mfloat/s      11.73 i/B   258.07 i/f (+/- 0.0 %)      2.88 c/B    63.38 c/f (+/- 0.2 %)      4.07 i/c      4.20 GHz

This PR (https://github.com/leni536/fast_float/commit/6d2fb68f5c928f75642762b235875cfbbd807687):

fastfloat                               :  1487.16 MB/s (+/- 0.3 %)    70.88 Mfloat/s      11.82 i/B   260.07 i/f (+/- 0.0 %)      2.69 c/B    59.24 c/f (+/- 0.1 %)      4.39 i/c      4.20 GHz

32 bit

main branch:

fastfloat                               :  1504.07 MB/s (+/- 0.3 %)    71.69 Mfloat/s      11.64 i/B   256.00 i/f (+/- 0.0 %)      2.66 c/B    58.56 c/f (+/- 0.2 %)      4.37 i/c      4.20 GHz

This PR:

fastfloat                               :  1483.64 MB/s (+/- 0.8 %)    70.71 Mfloat/s      11.73 i/B   258.00 i/f (+/- 0.0 %)      2.70 c/B    59.35 c/f (+/- 0.2 %)      4.35 i/c      4.20 GHz

compiler: gcc-12 (Debian 12.2.0-13) 12.2.0 CPU: AMD Ryzen 5 3600

lemire commented 1 year ago

@leni536 Do you understand why your PR adds two instructions per float parsed under GCC/x64? On 64-bit ARM, the instruction count is the same, or one less.

The following https://godbolt.org/z/ceYK4crde suggests that you would be saving instructions, not adding instructions.

leni536 commented 1 year ago

@leni536 Do you understand why your PR adds two instructions per float parsed under GCC/x64? On 64-bit ARM, the instruction count is the same, or one less.

The following https://godbolt.org/z/ceYK4crde suggests that you would be saving instructions, not adding instructions.

I don't have any idea. I might look into the benchmark binaries to track this down.

I don't fully trust AMD performance counters (some of them are known to be inaccurate), so I reran the benchmarks on an Intel Pentium Silver N5030. It reproduces this same i/f metric, so it appears to be correct.

lemire commented 1 year ago

We will be merging this for sure, because there is a gain, but it would be wise to figure out where the extra instructions come from before we do. It seems possible to improve this PR further if we have a better understanding of what the compiler does.

leni536 commented 1 year ago

Here is a comparison of an instantiated from_chars<double> in the two versions:

https://godbolt.org/z/hq988T4K6

An inlined call to to_float starts around line 2216 in the assembler output of the original. It seems that with inlining the PR version only saves one instruction, as opposed to a separately instantiated to_float, where it saves two.

It seems that there is a knock-on effect on register allocation in the whole surrounding function, which in turn might also affect ordering some blocks of code. Maybe there are a few places where a register needs to be zeroed where in the original it was not needed, adding an instruction, but this is just speculation.

So my current hypothesis is that the PR version does not save much in terms of executed instruction count, and after inlining register allocation and maybe some optimization heuristics happen to work slightly worse, adding back instructions in code unrelated to this change.

lemire commented 1 year ago

Fantastic. Ok. So I am merging this because it looks to produce slightly faster code and it is nicer looking.

lemire commented 1 year ago

It will be part of the next release.

fastfloat / fast_float

Simplify to_float #179

64-bit benchmark

32-bit benchmark

64 bit

32 bit