Closed leni536 closed 1 year ago
Looks promising !!!
I did some benchmarking using https://github.com/lemire/simple_fastfloat_benchmark
main branch (b1d27734c5443c91847a8fc36e0e32ef23fc9f38):
$ build/benchmarks/benchmark
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB
netlib : 307.67 MB/s (+/- 1.3 %) 14.66 Mfloat/s
doubleconversion : 264.45 MB/s (+/- 6.6 %) 12.60 Mfloat/s
strtod : 226.77 MB/s (+/- 0.3 %) 10.81 Mfloat/s
abseil : 516.82 MB/s (+/- 0.2 %) 24.63 Mfloat/s
fastfloat : 1387.79 MB/s (+/- 0.3 %) 66.15 Mfloat/s
$ build/benchmarks/benchmark32
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB
strtof : 207.98 MB/s (+/- 0.3 %) 9.91 Mfloat/s
abseil : 514.02 MB/s (+/- 0.9 %) 24.50 Mfloat/s
fastfloat : 1502.88 MB/s (+/- 0.3 %) 71.63 Mfloat/s
this PR (https://github.com/leni536/fast_float/commit/6d2fb68f5c928f75642762b235875cfbbd807687):
$ build/benchmarks/benchmark
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB
netlib : 306.21 MB/s (+/- 0.2 %) 14.59 Mfloat/s
doubleconversion : 273.84 MB/s (+/- 6.6 %) 13.05 Mfloat/s
strtod : 226.31 MB/s (+/- 0.9 %) 10.79 Mfloat/s
abseil : 513.81 MB/s (+/- 1.6 %) 24.49 Mfloat/s
fastfloat : 1486.56 MB/s (+/- 0.3 %) 70.85 Mfloat/s
$ build/benchmarks/benchmark32
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB
strtof : 207.30 MB/s (+/- 0.4 %) 9.88 Mfloat/s
abseil : 532.52 MB/s (+/- 0.3 %) 25.38 Mfloat/s
fastfloat : 1485.83 MB/s (+/- 0.4 %) 70.82 Mfloat/s
compiler: gcc-12 (Debian 12.2.0-13) 12.2.0 CPU: AMD Ryzen 5 3600
It looks like double conversion throughput increased around 7%, but float conversion throughput decreased around 1%.
It might be worth checking the two code changes (right-sized uint and ternary) independently. I'm happy to do that soon.
@leni536 Can you run this benchmark in privileged mode (e.g., with sudo) so that you get the performance counters?
Here is what I get on my macbook (ARM M2, LLVM 14).
Main:
fastfloat : 1879.51 MB/s (+/- 0.7 %) 89.58 Mfloat/s 13.00 i/B 286.10 i/f (+/- 0.0 %) 1.73 c/B 38.07 c/f (+/- 0.4 %) 7.52 i/c 3.41 GHz
Your PR
fastfloat : 1887.33 MB/s (+/- 0.6 %) 89.96 Mfloat/s 13.00 i/B 286.11 i/f (+/- 0.0 %) 1.72 c/B 37.91 c/f (+/- 0.4 %) 7.55 i/c 3.41 GHz
Main:
fastfloat : 1948.01 MB/s (+/- 3.5 %) 92.85 Mfloat/s 12.50 i/B 275.07 i/f (+/- 0.0 %) 1.67 c/B 36.67 c/f (+/- 0.9 %) 7.50 i/c 3.41 GHz
Your PR
fastfloat : 2003.42 MB/s (+/- 4.2 %) 95.49 Mfloat/s 12.46 i/B 274.07 i/f (+/- 0.0 %) 1.67 c/B 36.72 c/f (+/- 0.9 %) 7.46 i/c 3.51 GHz
If I use the command./build/benchmarks/benchmark -f data/canada.txt
, I get...
Main branch...
fastfloat : 1414.85 MB/s (+/- 1.5 %) 81.31 Mfloat/s 16.15 i/B 294.68 i/f (+/- 0.0 %) 2.35 c/B 42.95 c/f (+/- 1.0 %) 6.86 i/c 3.49 GHz
This PR
fastfloat : 1424.45 MB/s (+/- 1.5 %) 81.86 Mfloat/s 16.15 i/B 294.68 i/f (+/- 0.0 %) 2.34 c/B 42.77 c/f (+/- 1.2 %) 6.89 i/c 3.50 GHz
@leni536 Can you run this benchmark in privileged mode (e.g., with sudo) so that you get the performance counters?
main branch (b1d27734c5443c91847a8fc36e0e32ef23fc9f38):
fastfloat : 1390.38 MB/s (+/- 0.3 %) 66.27 Mfloat/s 11.73 i/B 258.07 i/f (+/- 0.0 %) 2.88 c/B 63.38 c/f (+/- 0.2 %) 4.07 i/c 4.20 GHz
This PR (https://github.com/leni536/fast_float/commit/6d2fb68f5c928f75642762b235875cfbbd807687):
fastfloat : 1487.16 MB/s (+/- 0.3 %) 70.88 Mfloat/s 11.82 i/B 260.07 i/f (+/- 0.0 %) 2.69 c/B 59.24 c/f (+/- 0.1 %) 4.39 i/c 4.20 GHz
main branch:
fastfloat : 1504.07 MB/s (+/- 0.3 %) 71.69 Mfloat/s 11.64 i/B 256.00 i/f (+/- 0.0 %) 2.66 c/B 58.56 c/f (+/- 0.2 %) 4.37 i/c 4.20 GHz
This PR:
fastfloat : 1483.64 MB/s (+/- 0.8 %) 70.71 Mfloat/s 11.73 i/B 258.00 i/f (+/- 0.0 %) 2.70 c/B 59.35 c/f (+/- 0.2 %) 4.35 i/c 4.20 GHz
compiler: gcc-12 (Debian 12.2.0-13) 12.2.0 CPU: AMD Ryzen 5 3600
@leni536 Do you understand why your PR adds two instructions per float parsed under GCC/x64? On 64-bit ARM, the instruction count is the same, or one less.
The following https://godbolt.org/z/ceYK4crde suggests that you would be saving instructions, not adding instructions.
@leni536 Do you understand why your PR adds two instructions per float parsed under GCC/x64? On 64-bit ARM, the instruction count is the same, or one less.
The following https://godbolt.org/z/ceYK4crde suggests that you would be saving instructions, not adding instructions.
I don't have any idea. I might look into the benchmark binaries to track this down.
I don't fully trust AMD performance counters (some of them are known to be inaccurate), so I reran the benchmarks on an Intel Pentium Silver N5030. It reproduces this same i/f metric, so it appears to be correct.
We will be merging this for sure, because there is a gain, but it would be wise to figure out where the extra instructions come from before we do. It seems possible to improve this PR further if we have a better understanding of what the compiler does.
Here is a comparison of an instantiated from_chars<double>
in the two versions:
https://godbolt.org/z/hq988T4K6
An inlined call to to_float
starts around line 2216 in the assembler output of the original. It seems that with inlining the PR version only saves one instruction, as opposed to a separately instantiated to_float
, where it saves two.
It seems that there is a knock-on effect on register allocation in the whole surrounding function, which in turn might also affect ordering some blocks of code. Maybe there are a few places where a register needs to be zeroed where in the original it was not needed, adding an instruction, but this is just speculation.
So my current hypothesis is that the PR version does not save much in terms of executed instruction count, and after inlining register allocation and maybe some optimization heuristics happen to work slightly worse, adding back instructions in code unrelated to this change.
Fantastic. Ok. So I am merging this because it looks to produce slightly faster code and it is nicer looking.
It will be part of the next release.
Codegen demonstration:
gcc 12: https://godbolt.org/z/ceYK4crde clang 15: https://godbolt.org/z/WTcqn8xGv MSVC 19.14: https://godbolt.org/z/P3d8dv9Eq