Open tevador opened 2 years ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | tevador |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
cc @dotnet/jit-contrib @tannergooding.
This is a similar issue to https://github.com/dotnet/runtime/issues/5213, https://github.com/dotnet/runtime/issues/27292, https://github.com/dotnet/runtime/issues/68207, https://github.com/dotnet/runtime/issues/58263, etc
The root issue is that the JIT doesn't have any code to optimize/special-case 32 * 32 = 64
or 64 * 64 = 128
scenarios. This is partially because it requires specialized logic since it represents an instruction with "multiple register returns".
Carol Eidt added some support for multiple register returns a while back, but its not been picked up many places in the JIT yet. I think @kunalspathak and @EgorBo are currently the two with the most context as to what would be required to make this work here.
Taking a quick look, we do have imul
instructions that are generated for GT_MUL
, but seems that for BigMul
case, we end up using hardware intrinsics variant of MultiplyNoFlags
and generate mulx
.
The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.
Yeah, currently it is getting generated because that's how we implemented it.
Taking a quick look, we do have imul instructions that are generated for GT_MUL
Right, but much like with div
and idiv
, we only handle "one register" of the output. We don't have the handling required to get out both EAX
and EDX
and put them in the respective return slots for the result.
This is why we do the mulx
approach instead as that's the "next closest thing". Ideally we'd mark Math.BigMul
as intrinsic and have the JIT treat it like we do for mulx
, but emitting imul
instead so it computes the correct sign for upper
.
I think this would be straightforward to fix by implementing #58263, which should be very simple after #66551 is merged.
So the main issue is the lack of x86 intrinsics for wide multiplication except of Bmi2.X64.MultiplyNoFlags
, which is not part of the base x64 instruction set. Older x64 CPUs such as Intel Ivy Bridge will get the software fallback even through they are capable of performing the calculation with one instruction (mul
or imul
).
A proper fix for this problem will speed up benchmarks for XxHash3: https://github.com/dotnet/runtime/pull/76641 on x64
Description
The JIT generates suboptimal x64 code for
Math.BigMul(long, long, out long)
.Configuration
Data
Source code:
Results in the following machine code:
Analysis
The unsigned overload uses a single
mulx
instruction as expected.The signed overload also uses
mulx
with additional 6 instructions (2xsar, and, sub
) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence ofmulx, sar, and, sub, sar, and, sub
could thus be replaced by a singleimul
instruction.Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.
category:cq theme:floating-point skill-level:intermediate cost:medium impact:small