Open MaxGraey opened 3 years ago
Hi @MaxGraey, thanks for bringing this up!
From an instruction-selector perspective, this would certainly be possible, though in general I want to be careful about adding complexity without evidence that it will bring benefit in some way. (It's perhaps possible for one case, but if the isel starts growing conditionals for microarchitecture and for optimized cases everywhere, it soon becomes unmaintainable if we don't have a framework for reasoning about machine costs in a more principled way.) Have you seen workloads where division latency in particular became a bottleneck?
Have you seen workloads where division latency in particular became a bottleneck?
Div / Rem usually most expensive operations. There are many possibilities to reduce cost of this operations, most often this is done before instruction scheduling. But there is usually very little opportunity to speed up the most general case. In this case, I think this is interesting approach which shouldn't be too complex in term of implementation.
Indeed, I agree that division is an expensive instruction! What I'm wondering is whether workloads exist where (i) the presence of divide instructions is a bottleneck, i.e., the latency is not hidden by other operations and is a visible portion of total runtime, and (ii) the workload uses 64-bit divides but values always have zeroed top-32-bits.
That data would show the benefit; the cost, in turn is (i) an additional comparison (logical-or, right-shift, branch-if-zero, so three instructions) in every divide's main code path, (ii) additional code size due to the comparison and the fastpath, (iii) complexity in isel, which imposes maintenance burden and a small compile time increase.
So, basically, I have a better handle on what the cost would be and I'm curious if we have any data points on the potential benefit. I'm hesitant to include something like this, because of the above downsides, but evidence of real upside could convince me :-)
Oh, so you ask about real scenarios which could potentially benefit from this?
I guess all range of modern cryptography which based on modular arithmetic with small primes. For example 64-bit integer modulo intensively used in Montgomery reduction for example in modPow
(x1 ^ x2 mod m) part. I guess @jedisct1 could tell much more than me. Another potential user case is multi-precision float points and integers which also use modular arithmetic
Fair enough -- I think that we could investigate further. I'm still worried about the impact of the three additional instructions for every divide; so I'd want us to collect data on various numeric benchmarks (compression and media codecs come to mind) to make sure this impact isn't an issue. The benchmarking situation is looking to improve soon (bytecodealliance/rfcs#4) so we should be able to study this fairly easily!
@MaxGraey would you be willing to prototype it?
historically, this was implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transformation.
The benchmarking situation is looking to improve soon
Nice!
@MaxGraey would you be willing to prototype it?
Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)
Thanks for the link to the LLVM transform!
Re: overhead, one option would be to add a flag to our CPU-specific backend flags to enable this; then the embedder could enable on platforms where this helps.
I'll see if I can get to this at some point (lots of things on my plate at the moment) but anyone feel free to try this in the meantime!
As an example where division had disproportionate effects: https://github.com/gimli-rs/gimli/pull/476 made line debuginfo generation twice as fast by avoiding a division in the common case.
Feature
Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:
godbolt: https://godbolt.org/z/Tqqzs1
Is it make sense apply same optimization for cranelift only for x64 architecture?
Benefit
it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.