I should just have done this with the previous PR.
The function to calculate a widening multiply for u64 using 32-bit multiples is now a macro, so it can also be used for u128 with 64-bit multiplies. It is not yet optimal on 32-bit architectures, but much better than what we had.
I should just have done this with the previous PR.
The function to calculate a widening multiply for
u64
using 32-bit multiples is now a macro, so it can also be used foru128
with 64-bit multiplies. It is not yet optimal on 32-bit architectures, but much better than what we had.Benchmarks before:
After: