dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

Unnecessary sign extension in `Vector512.Create(sbyte)` and `Vector512.Create(short)` #108820

Open MineCake147E opened 1 week ago

MineCake147E commented 1 week ago

Description

When I was playing around with the Vector512<sbyte>, I noticed that Vector512.Create(sbyte) first sign-extends the value, then broadcasts to the whole vector register.

static Vector512<sbyte> Create1(ulong value) => 
    Vector512.Create((sbyte)value);
Program:Create1(ulong):System.Runtime.Intrinsics.Vector512`1[byte] (FullOpts):
       movsx    rsi, sil
       vpbroadcastb zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret

Here, sign-extension is unnecessary, as the upper 56 bits of rsi would be completely ignored anyway.

A similar thing happens in Vector512.Create(short) as well.

static Vector512<short> Create2(ulong value) => 
    Vector512.Create((short)value);
Program:Create2(ulong):System.Runtime.Intrinsics.Vector512`1[short] (FullOpts):
       movsx    rsi, si
       vpbroadcastw zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

As of uops.info, in recent Intel CPUs, from Golden Cove and Gracemont microarchitectures onward, the movzx instruction would be interpreted as a register rename, so it does not require a clock cycle to execute. The same is true for earlier microarchitectures if the destination register is different from the source register. This is significantly faster than movsx, which takes one clock cycle in any case. And movsx can be executed in fewer execution ports than movzx.

Possible optimization:

A known workaround for this problem is to first use unsigned types to broadcast, then re-interpret the vector register to be one with signed data.

static Vector512<sbyte> Create3(ulong value) => 
    Vector512.Create((byte)value).AsSByte();

static Vector512<short> Create4(ulong value) => 
    Vector512.Create((ushort)value).AsInt16();
Program:Create3(ulong):System.Runtime.Intrinsics.Vector512`1[byte] (FullOpts):
       movzx    rsi, sil
       vpbroadcastb zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

Program:Create4(ulong):System.Runtime.Intrinsics.Vector512`1[short] (FullOpts):
       movzx    rsi, si
       vpbroadcastw zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

Configuration

Compiler Explorer%3B%0A%0A++++%5BMethodImpl(MethodImplOptions.NoInlining)%5D%0A++++static+Vector512%3Cshort%3E+Create2(ulong+value)+%3D%3E+%0A++++++++Vector512.Create((short)value)%3B%0A%0A++++%0A%7D%0A'),l:'5',n:'1',o:'C%23+source+%231',t:'0')),k:46.82726204465335,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:dotnettrunkcsharp,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:csharp,libs:!(),options:'--targetos+Linux+--targetarch+x64+--instruction-set+base,sse,sse2,sse3,ssse3,sse4.1,sse4.2,avx,avx2,aes,bmi,bmi2,fma,lzcnt,pclmul,popcnt,avxvnni,evex,avx512f,avx512f_vl,avx512bw,avx512bw_vl,avx512cd,avx512cd_vl,avx512dq,avx512dq_vl,avx512vbmi,avx512vbmi_vl+--Ot+',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+.NET+(main)+(Editor+%231)',t:'0')),k:53.17273795534665,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)

Regression?

Unknown

Data

uops.info

Analysis

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

tannergooding commented 6 days ago

This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).

movzx/movsx is often desirable explicitly to break such dependencies and is unlikely to be the performance bottleneck in practice. On many pieces of hardware, movsx is likewise simply a register rename operation.


Less bytes and/or less instructions does not always mean faster. We need to ensure its measured and considered in the context of the entire application, not just theoretical cycle counts.

MineCake147E commented 6 days ago

On many pieces of hardware, movsx is likewise simply a register rename operation.

My measurement says otherwise, at least in Golden Cove microarchitecture.

Benchmark Code ```chsrap using System; using System.Collections.Generic; using System.Linq; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Security.Cryptography; using System.Text; using System.Threading.Tasks; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Jobs; namespace BenchmarkPlayground { [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)] [DisassemblyDiagnoser(maxDepth: int.MaxValue)] public class SignExtensionBenchmarks { [GlobalSetup] public void Setup() { Span k = [0, 0, 0]; RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(k)); l0 = k[0]; l1 = k[1] | 1; l2 = k[2]; } nuint l0, l1 = 1, l2; [SkipLocalsInit] [Benchmark(Baseline = true, OperationsPerInvoke = 1 << 20)] public nuint IntegerAddLatency() { var v0 = l0; var v1 = l1; var v2 = l2; for (int i = 0; i < 1 << 20; i += 16) { v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v0 += v1; v1 += v2; // Make sure that RyuJIT doesn't optimize all additions above away. } return v0; } [SkipLocalsInit] [Benchmark(OperationsPerInvoke = 1 << 20)] public ulong SignExtendLatency() { var v0 = (nint)l0; var v1 = (nint)l1; var v2 = (nint)l2; sbyte h; for (int i = 0; i < 1 << 20; i += 16) { h = (sbyte)v0; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; h = (sbyte)v1; v1 += h; v0 += v2; // Make sure that RyuJIT doesn't optimize all additions above away. } return (nuint)v1; } [SkipLocalsInit] [Benchmark(OperationsPerInvoke = 1 << 20)] public ulong ZeroExtendLatency() { var v0 = l0; var v1 = l1; var v2 = l2; byte h; for (int i = 0; i < 1 << 20; i += 16) { h = (byte)v0; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; h = (byte)v1; v1 += h; v0 += v2; // Make sure that RyuJIT doesn't optimize all additions above away. } return v1; } } } ```

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Xeon w5-2455X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Mean Error StdDev Ratio Code Size
IntegerAddLatency 0.2531 ns 0.0007 ns 0.0005 ns 1.00 81 B
SignExtendLatency 0.4914 ns 0.0011 ns 0.0009 ns 1.94 148 B
ZeroExtendLatency 0.2535 ns 0.0008 ns 0.0007 ns 1.00 148 B
Benchmark Disassembly ## .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ```assembly ; BenchmarkPlayground.SignExtensionBenchmarks.IntegerAddLatency() mov rax,[rcx+8] mov rdx,[rcx+10] mov rcx,[rcx+18] xor r8d,r8d nop M00_L00: add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rax,rdx add rdx,rcx add r8d,10 cmp r8d,100000 jl short M00_L00 ret ; Total bytes of code 81 ``` ## .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ```assembly ; BenchmarkPlayground.SignExtensionBenchmarks.SignExtendLatency() mov rax,[rcx+8] mov rdx,[rcx+10] mov rcx,[rcx+18] xor r8d,r8d nop M00_L00: movsx r10,al add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 movsx r10,dl add rdx,r10 add rax,rcx add r8d,10 cmp r8d,100000 jl short M00_L00 mov rax,rdx ret ; Total bytes of code 148 ``` ## .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ```assembly ; BenchmarkPlayground.SignExtensionBenchmarks.ZeroExtendLatency() mov rax,[rcx+8] mov rdx,[rcx+10] mov rcx,[rcx+18] xor r8d,r8d nop M00_L00: movzx r10d,al add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 movzx r10d,dl add rdx,r10 add rax,rcx add r8d,10 cmp r8d,100000 jl short M00_L00 mov rax,rdx ret ; Total bytes of code 148 ```

The zero extension seems to be treated as a register renaming operation. In contrast, the sign extension turns out to be significantly slower than the zero extension, meaning it's an actual operation beyond register renaming.


This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).

movzx/movsx is often desirable explicitly to break such dependencies

This is why I didn't propose to eliminate movzx entirely in all cases. And replacing movsx with movzx solves the aforementioned latency issues on Intel CPUs without changing the result of Vector512.Create(sbyte) and Vector512.Create(short), and without introducing a new dependency problem.

tannergooding commented 5 days ago

My measurement says otherwise, at least in Golden Cove microarchitecture.

I did say "many" not "all".

These types of changes are beginning to get near microarchitecture specific levels and are rarely meaningful to real world perf for an end to end application scenario. You will inversely find, as an example, that movzx is not free and itself has actual cost on many CPUs, particularly several CPUs released in the last 10 years (Skylake is one notable example). While you will find that AMD Zen processors, as an example, do have zero cost movzx and movsx.

You will also find that microbenches like the above are not representative of actual instruction latency. Rather, they may only be representative in particular scenarios. The official architecture optimization manuals, as well as various other well known in depth guide's (such as by Agner Fog), go into more detail and cover additional considerations like decoder delays, dependency chains, instruction fusion, and many other considerations where the use of movzx and movsx improve performance and/or allow the CPU to produce better internal microcode.


In cases where removing movzx/movsx allows a broadcast to contain the memory operand directly, removing the sign extension should be beneficial across all hardware. However, in other cases it is very much going to be a hit or miss scenario and it is better/safer (and follows the more general official guidance across all microarchitectures) to preserve the designated and appropriate extension prior to consuming the full register width. This may be elidable in a subset of scenarios where we know the register was not previously a partial mutation, but that would likely be a minor peephole optimization and not something we'd want to broadly model in direct IR.