Open MineCake147E opened 1 week ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).
movzx/movsx
is often desirable explicitly to break such dependencies and is unlikely to be the performance bottleneck in practice. On many pieces of hardware, movsx
is likewise simply a register rename operation.
Less bytes and/or less instructions does not always mean faster. We need to ensure its measured and considered in the context of the entire application, not just theoretical cycle counts.
On many pieces of hardware,
movsx
is likewise simply a register rename operation.
My measurement says otherwise, at least in Golden Cove microarchitecture.
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Xeon w5-2455X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
[Host] : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method | Mean | Error | StdDev | Ratio | Code Size |
---|---|---|---|---|---|
IntegerAddLatency | 0.2531 ns | 0.0007 ns | 0.0005 ns | 1.00 | 81 B |
SignExtendLatency | 0.4914 ns | 0.0011 ns | 0.0009 ns | 1.94 | 148 B |
ZeroExtendLatency | 0.2535 ns | 0.0008 ns | 0.0007 ns | 1.00 | 148 B |
The zero extension seems to be treated as a register renaming operation. In contrast, the sign extension turns out to be significantly slower than the zero extension, meaning it's an actual operation beyond register renaming.
This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).
movzx/movsx
is often desirable explicitly to break such dependencies
This is why I didn't propose to eliminate movzx
entirely in all cases.
And replacing movsx
with movzx
solves the aforementioned latency issues on Intel CPUs without changing the result of Vector512.Create(sbyte)
and Vector512.Create(short)
, and without introducing a new dependency problem.
My measurement says otherwise, at least in Golden Cove microarchitecture.
I did say "many" not "all".
These types of changes are beginning to get near microarchitecture specific levels and are rarely meaningful to real world perf for an end to end application scenario. You will inversely find, as an example, that movzx
is not free and itself has actual cost on many CPUs, particularly several CPUs released in the last 10 years (Skylake is one notable example). While you will find that AMD Zen processors, as an example, do have zero cost movzx
and movsx
.
You will also find that microbenches like the above are not representative of actual instruction latency. Rather, they may only be representative in particular scenarios. The official architecture optimization manuals, as well as various other well known in depth guide's (such as by Agner Fog), go into more detail and cover additional considerations like decoder delays, dependency chains, instruction fusion, and many other considerations where the use of movzx
and movsx
improve performance and/or allow the CPU to produce better internal microcode.
In cases where removing movzx
/movsx
allows a broadcast to contain the memory operand directly, removing the sign extension should be beneficial across all hardware. However, in other cases it is very much going to be a hit or miss scenario and it is better/safer (and follows the more general official guidance across all microarchitectures) to preserve the designated and appropriate extension prior to consuming the full register width. This may be elidable in a subset of scenarios where we know the register was not previously a partial mutation, but that would likely be a minor peephole optimization and not something we'd want to broadly model in direct IR.
Description
When I was playing around with the
Vector512<sbyte>
, I noticed thatVector512.Create(sbyte)
first sign-extends the value, then broadcasts to the whole vector register.Here, sign-extension is unnecessary, as the upper 56 bits of
rsi
would be completely ignored anyway.A similar thing happens in
Vector512.Create(short)
as well.As of uops.info, in recent Intel CPUs, from Golden Cove and Gracemont microarchitectures onward, the
movzx
instruction would be interpreted as a register rename, so it does not require a clock cycle to execute. The same is true for earlier microarchitectures if the destination register is different from the source register. This is significantly faster thanmovsx
, which takes one clock cycle in any case. Andmovsx
can be executed in fewer execution ports thanmovzx
.Possible optimization:
movzx
instead (Intel CPUs only).movsx rsi, si
→vpbroadcastw zmm0, esi
can bemovzx rax, si
→vpbroadcastw zmm0, eax
.movzx
wouldn't take any longer thanmovsx
anyway.Vector512.Create((sbyte)(rsi >> -8))
would just beshr rsi, 56
→vpbroadcastb zmm0, esi
.A known workaround for this problem is to first use unsigned types to broadcast, then re-interpret the vector register to be one with signed data.
Configuration
Compiler Explorer%3B%0A%0A++++%5BMethodImpl(MethodImplOptions.NoInlining)%5D%0A++++static+Vector512%3Cshort%3E+Create2(ulong+value)+%3D%3E+%0A++++++++Vector512.Create((short)value)%3B%0A%0A++++%0A%7D%0A'),l:'5',n:'1',o:'C%23+source+%231',t:'0')),k:46.82726204465335,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:dotnettrunkcsharp,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:csharp,libs:!(),options:'--targetos+Linux+--targetarch+x64+--instruction-set+base,sse,sse2,sse3,ssse3,sse4.1,sse4.2,avx,avx2,aes,bmi,bmi2,fma,lzcnt,pclmul,popcnt,avxvnni,evex,avx512f,avx512f_vl,avx512bw,avx512bw_vl,avx512cd,avx512cd_vl,avx512dq,avx512dq_vl,avx512vbmi,avx512vbmi_vl+--Ot+',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+.NET+(main)+(Editor+%231)',t:'0')),k:53.17273795534665,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)
Regression?
Unknown
Data
uops.info
Analysis