[proposal] fast path for general integer division for x64

MaxGraey commented 3 years ago

Feature

Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:

idiv(a: i64, b: i64) -> i64 {
  if ((a | b) >> 32) return a / b; // full 64-bit division
  return i32(a) / i32(b) as i64; // 32-bit division
}

godbolt: https://godbolt.org/z/Tqqzs1

Is it make sense apply same optimization for cranelift only for x64 architecture?

Benefit

it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:

But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.

cfallin commented 3 years ago

Hi @MaxGraey, thanks for bringing this up!

From an instruction-selector perspective, this would certainly be possible, though in general I want to be careful about adding complexity without evidence that it will bring benefit in some way. (It's perhaps possible for one case, but if the isel starts growing conditionals for microarchitecture and for optimized cases everywhere, it soon becomes unmaintainable if we don't have a framework for reasoning about machine costs in a more principled way.) Have you seen workloads where division latency in particular became a bottleneck?

MaxGraey commented 3 years ago

Have you seen workloads where division latency in particular became a bottleneck?

Div / Rem usually most expensive operations. There are many possibilities to reduce cost of this operations, most often this is done before instruction scheduling. But there is usually very little opportunity to speed up the most general case. In this case, I think this is interesting approach which shouldn't be too complex in term of implementation.

cfallin commented 3 years ago

Indeed, I agree that division is an expensive instruction! What I'm wondering is whether workloads exist where (i) the presence of divide instructions is a bottleneck, i.e., the latency is not hidden by other operations and is a visible portion of total runtime, and (ii) the workload uses 64-bit divides but values always have zeroed top-32-bits.

That data would show the benefit; the cost, in turn is (i) an additional comparison (logical-or, right-shift, branch-if-zero, so three instructions) in every divide's main code path, (ii) additional code size due to the comparison and the fastpath, (iii) complexity in isel, which imposes maintenance burden and a small compile time increase.

So, basically, I have a better handle on what the cost would be and I'm curious if we have any data points on the potential benefit. I'm hesitant to include something like this, because of the above downsides, but evidence of real upside could convince me :-)

MaxGraey commented 3 years ago

Oh, so you ask about real scenarios which could potentially benefit from this?

I guess all range of modern cryptography which based on modular arithmetic with small primes. For example 64-bit integer modulo intensively used in Montgomery reduction for example in modPow (x1 ^ x2 mod m) part. I guess @jedisct1 could tell much more than me. Another potential user case is multi-precision float points and integers which also use modular arithmetic

cfallin commented 3 years ago

Fair enough -- I think that we could investigate further. I'm still worried about the impact of the three additional instructions for every divide; so I'd want us to collect data on various numeric benchmarks (compression and media codecs come to mind) to make sure this impact isn't an issue. The benchmarking situation is looking to improve soon (bytecodealliance/rfcs#4) so we should be able to study this fairly easily!

@MaxGraey would you be willing to prototype it?

MaxGraey commented 3 years ago

historically, this was implemented exclusively for Intel Atom back in 2013, but later it is already used even for generic x64. I think the guys from LLVM have learned pretty well the specifics of this BypassSlowDivision transformation.

The benchmarking situation is looking to improve soon

Nice!

@MaxGraey would you be willing to prototype it?

Thanks for the suggestions, but I think this is a bit tricky and low-level case for me)

cfallin commented 3 years ago

Thanks for the link to the LLVM transform!

Re: overhead, one option would be to add a flag to our CPU-specific backend flags to enable this; then the embedder could enable on platforms where this helps.

I'll see if I can get to this at some point (lots of things on my plate at the moment) but anyone feel free to try this in the meantime!

bjorn3 commented 3 years ago

As an example where division had disproportionate effects: https://github.com/gimli-rs/gimli/pull/476 made line debuginfo generation twice as fast by avoiding a division in the common case.

bytecodealliance / wasmtime

[proposal] fast path for general integer division for x64 #2439

Feature

Benefit