Closed henninglive closed 7 years ago
Wow I did not expect u128 to be 100 times slower than u64.
I would prefer not to merge this then. The advantage of this crate over std::fmt is that it avoids a small constant amount of overhead from the Formatter machinery. Going from 42ns to 10ns to print an i16 is a big deal, but going from 2530ns to 2500ns won't matter. what do you think?
I agree there no point in adding this if it’s this slow, but I want to do some profiling before closing this. At first glace, I believe the hotspots are __umodti3()
and __udivti3()
which is LLVM’s built-in implementation of divide and modulus for u128. They are very slow in compression to instructions. I noticed we are always calling them with same dividend and divisor so we should really be calling __udivmodti4()
to get both the reminder and quotient in one operation, changing this might give substantial speedup.
I have done some profiling and optimization. The reason it’s so slow are the __umodti3()
and __udivti3()
intrinsics. They are very slow compared the native div instruction. To make matters worse, LLVM currently fails to optimize and combine the two separate calls to __udivti3()
and __umodti3()
into a __udivmodti4()
call, see rust-lang/rust/#44545. Funnily enough, __udivti3()
and __umodti3()
are implemented as calls to __udivmodti4()
, which means we make two consecutively calls to same function with the same parameters.
We can optimize this my manually linking to __udivmodti4()
, but i128/u128 does unfortunately not have a stable ABI, see rust-lang/rust/#44261. It’s possible, but it would be a hack. Alternatively we can just copy the function into the crate, this would also have the benefit allowing inlining, which won’t happen otherwise.
Here is benchmark showing the effects of these optimizations:
Unoptimized:
test bench_fmt::bench_u128_max ... bench: 3,046 ns/iter (+/- 183)
test bench_fmt::bench_u64_max ... bench: 59 ns/iter (+/- 4)
test bench_itoa::bench_u128_max ... bench: 3,009 ns/iter (+/- 99)
test bench_itoa::bench_u64_max ... bench: 24 ns/iter (+/- 0)
Manual linking:
test bench_itoa::bench_u128_max ... bench: 1,537 ns/iter (+/- 59)
Inlined:
test bench_itoa::bench_u128_max ... bench: 1,289 ns/iter (+/- 54)
At this point I think implementing the latter option makes sense, it’s over 100% faster then std::fmt
. I have therefore amended the pull request to include this change.
Nicely done! I filed https://github.com/rust-lang/rust/issues/44583 to provide your faster implementation in std::fmt if possible.
But that gets us back to the previous square -- once std has this implementation, itoa cannot really be any faster. We could provide the fast implementation in itoa until it reaches the stable compiler, then remove it. But maybe a better option would be to provide this in its own crate for now (in addition to a compiler PR).
I don't think inlining is going to happen in std, but once rust-lang/rust/#44545 is fixed std::fmt
will be slightly slower then Manual linking
. Inlining will still be 200ns faster than std::fmt. Rustc might be able to automatically inline the builtins in at some point, but not in the foreseeable future. If added this would be mutch faster for good while. Also, I think we should have 128-bit support in this crate whenever i128/u128 are stable, just for completeness.
Adds support for 128-bit integers on the nightly compiler feature gated behind an
i128
feature flag.I used the same implementation as used by the other types, but ended up being as slow as
std::fmt
on my system. I look into why later, when I have some time.Resolves #9