dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.99k stars 4.66k forks source link

Suboptimal x64 codegen for signed Math.BigMul #75594

Open tevador opened 2 years ago

tevador commented 2 years ago

Description

The JIT generates suboptimal x64 code for Math.BigMul(long, long, out long).

Configuration

Data

Source code:

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul2(ref ulong x, ref ulong y)
{
    x = Math.BigMul(x, y, out y);
}

[MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)]
public static void TestBigMul1(ref long x, ref long y)
{
    x = Math.BigMul(x, y, out y);
}

Results in the following machine code:

TestBigMul1(ref ulong, ref ulong):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rax,r10,r8  
 mov         qword ptr [r9],r10  
 mov         rdx,qword ptr [rsp]  
 mov         r8,qword ptr [rsp+18h]  
 mov         qword ptr [r8],rdx  
 mov         qword ptr [rcx],rax  
 add         rsp,8  
 ret  

TestBigMul1(ref long, ref long):
 push        rax  
 mov         rax,qword ptr [rcx]  
 mov         qword ptr [rsp+18h],rdx  
 mov         r8,qword ptr [rdx]  
 lea         r9,[rsp]  
 mov         rdx,rax  
 mulx        rdx,r10,r8  
 mov         qword ptr [r9],r10  
 mov         r9,qword ptr [rsp]  
 mov         r10,qword ptr [rsp+18h]  
 mov         qword ptr [r10],r9  
 mov         r9,rax  
 sar         r9,3Fh  
 and         r9,r8  
 sub         rdx,r9  
 sar         r8,3Fh  
 and         rax,r8  
 sub         rdx,rax  
 mov         qword ptr [rcx],rdx  
 add         rsp,8  
 ret

Analysis

The unsigned overload uses a single mulx instruction as expected.

The signed overload also uses mulx with additional 6 instructions (2x sar, and, sub) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: the one-operand imul. The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.

category:cq theme:floating-point skill-level:intermediate cost:medium impact:small

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
### Description The JIT generates suboptimal x64 code for `Math.BigMul(long, long, out long)`. ### Configuration * .NET 6 * Windows 10 * AMD Ryzen CPU (x64) ### Data Source code: ```csharp [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)] public static void TestBigMul2(ref ulong x, ref ulong y) { x = Math.BigMul(x, y, out y); } [MethodImpl(MethodImplOptions.NoInlining | MethodImplOptions.AggressiveOptimization)] public static void TestBigMul1(ref long x, ref long y) { x = Math.BigMul(x, y, out y); } ``` Results in the following machine code: ```asm TestBigMul1(ref ulong, ref ulong): push rax mov rax,qword ptr [rcx] mov qword ptr [rsp+18h],rdx mov r8,qword ptr [rdx] lea r9,[rsp] mov rdx,rax mulx rax,r10,r8 mov qword ptr [r9],r10 mov rdx,qword ptr [rsp] mov r8,qword ptr [rsp+18h] mov qword ptr [r8],rdx mov qword ptr [rcx],rax add rsp,8 ret TestBigMul1(ref long, ref long): push rax mov rax,qword ptr [rcx] mov qword ptr [rsp+18h],rdx mov r8,qword ptr [rdx] lea r9,[rsp] mov rdx,rax mulx rdx,r10,r8 mov qword ptr [r9],r10 mov r9,qword ptr [rsp] mov r10,qword ptr [rsp+18h] mov qword ptr [r10],r9 mov r9,rax sar r9,3Fh and r9,r8 sub rdx,r9 sar r8,3Fh and rax,r8 sub rdx,rax mov qword ptr [rcx],rdx add rsp,8 ret ``` ### Analysis The unsigned overload uses a single `mulx` instruction as expected. The signed overload also uses `mulx` with additional 6 instructions (2x `sar, and, sub`) to adjust the upper half of the result. This increases the latency from 4 cycles to at least 8 cycles in fully inlined code. This is completely unnecessary as the x64 architecture has a dedicated instruction for signed multiplication: [the one-operand imul](https://www.felixcloutier.com/x86/imul). The whole sequence of `mulx, sar, and, sub, sar, and, sub` could thus be replaced by a single `imul` instruction. Also in this particular case, both methods use the stack unnecessarily, but that's probably a separate issue.
Author: tevador
Assignees: -
Labels: `tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone: -
JulieLeeMSFT commented 2 years ago

cc @dotnet/jit-contrib @tannergooding.

tannergooding commented 2 years ago

This is a similar issue to https://github.com/dotnet/runtime/issues/5213, https://github.com/dotnet/runtime/issues/27292, https://github.com/dotnet/runtime/issues/68207, https://github.com/dotnet/runtime/issues/58263, etc

The root issue is that the JIT doesn't have any code to optimize/special-case 32 * 32 = 64 or 64 * 64 = 128 scenarios. This is partially because it requires specialized logic since it represents an instruction with "multiple register returns".

Carol Eidt added some support for multiple register returns a while back, but its not been picked up many places in the JIT yet. I think @kunalspathak and @EgorBo are currently the two with the most context as to what would be required to make this work here.

kunalspathak commented 2 years ago

Taking a quick look, we do have imul instructions that are generated for GT_MUL, but seems that for BigMul case, we end up using hardware intrinsics variant of MultiplyNoFlags and generate mulx.

https://github.com/dotnet/runtime/blob/522f8088745144f251ab4c9a5ef999d4f5fce8fc/src/libraries/System.Private.CoreLib/src/System/Math.cs#L177-L183

The whole sequence of mulx, sar, and, sub, sar, and, sub could thus be replaced by a single imul instruction.

Yeah, currently it is getting generated because that's how we implemented it.

https://github.com/dotnet/runtime/blob/522f8088745144f251ab4c9a5ef999d4f5fce8fc/src/libraries/System.Private.CoreLib/src/System/Math.cs#L229-L231

tannergooding commented 2 years ago

Taking a quick look, we do have imul instructions that are generated for GT_MUL

Right, but much like with div and idiv, we only handle "one register" of the output. We don't have the handling required to get out both EAX and EDX and put them in the respective return slots for the result.

This is why we do the mulx approach instead as that's the "next closest thing". Ideally we'd mark Math.BigMul as intrinsic and have the JIT treat it like we do for mulx, but emitting imul instead so it computes the correct sign for upper.

SingleAccretion commented 2 years ago

I think this would be straightforward to fix by implementing #58263, which should be very simple after #66551 is merged.

tevador commented 2 years ago

So the main issue is the lack of x86 intrinsics for wide multiplication except of Bmi2.X64.MultiplyNoFlags, which is not part of the base x64 instruction set. Older x64 CPUs such as Intel Ivy Bridge will get the software fallback even through they are capable of performing the calculation with one instruction (mul or imul).

EgorBo commented 1 year ago

A proper fix for this problem will speed up benchmarks for XxHash3: https://github.com/dotnet/runtime/pull/76641 on x64