dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.08k stars 4.7k forks source link

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

Open gfoidl opened 5 years ago

gfoidl commented 5 years ago

Vector128<long>.ToScalar() stores the xmm to the stack, then reads r64 from there via a mov.

vmovapd  xmmword ptr [rsp], xmm0
mov      rax, qword ptr [rsp]

Ideally this would use movq (c++ intrinsic: _mm_cvtsi128_si64), so asm becomes:

-   vmovapd  xmmword ptr [rsp], xmm0
-   mov      rax, qword ptr [rsp]
+   movq     rax, xmm0

Vector128<double>.ToScalar() produces expected code (vmovsd) -- no issue there. Same CQ issue for int, and for Vector256<T>. Didn't check other types, than noted here.

category:cq theme:vector-codegen skill-level:intermediate cost:medium

gfoidl commented 5 years ago

/cc: @tannergooding please have a look here

fiigii commented 5 years ago

I remember that ToScalar on integer types are not intrinsic now. You can use Sse2.X64.ConvertToInt64 for Vector128<long>.

gfoidl commented 5 years ago

Thanks.

Sse2.X64.ConvertToInt64

Is better, but still not ideal:

vmovupd  xmm0, xmmword ptr [rsp+08H]
vmovd    rax, xmm0

It is documented as __int64 _mm_cvtsi128_si64 (__m128i a) MOVQ reg/m64, xmm, but JIT didn't emit the movq.

mikedn commented 5 years ago

Is that output from the JIT's own disassembler? It's probably movq but displayed as movd.

gfoidl commented 5 years ago

Looked at the JIT-dump and in VS-dissambly view (both in release with optimization on, tiering disabled).

SharpLab with CoreCLR shows the same.

fiigii commented 5 years ago

Right, vmovd rax, xmm0 is actually vmovq rax, xmm0. movd is an alias of movq on r64.

fiigii commented 5 years ago

SharpLab with CoreCLR shows the same.

BTW, in this link, vzeroupper is generated for Vector128 code. That should not be there, I will take a look.

gfoidl commented 5 years ago

movd is an alias of movq on r64.

👍

So codegen could be

vmovd    rax, xmmword ptr [rsp+08H]

or

vmovq    rax, xmmword ptr [rsp+08H]

There is the extra vmovupd (see https://github.com/dotnet/coreclr/issues/24710#issuecomment-494720476)

I will take a look.

Thanks.

My code to show this ```c# using System; using System.Runtime.CompilerServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace ConsoleApp1 { class Program { static void Main(string[] args) { Vector128 vec = Vector128.Create((byte)0x42).AsSByte(); long l = ToLong(vec); double d = ToDouble(vec); if (double.IsNaN(d) || l == long.MaxValue) Environment.Exit(1); } [MethodImpl(MethodImplOptions.NoInlining)] private static long ToLong(Vector128 vec) { //return vec.AsInt64().ToScalar(); return Sse2.X64.ConvertToInt64(vec.AsInt64()); } [MethodImpl(MethodImplOptions.NoInlining)] private static double ToDouble(Vector128 vec) { return vec.AsDouble().ToScalar(); } } } ```
dasm for that code ```asm ; Assembly listing for method Program:ToLong(struct):long ; Emitting BLENDED_CODE for X64 CPU with AVX - Unix ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 1, 1 ) simd16 -> [rsp+0x08] ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M6413_IG01: C5F877 vzeroupper 6690 nop G_M6413_IG02: C5F910442408 vmovupd xmm0, xmmword ptr [rsp+08H] C4E1F97EC0 vmovd rax, xmm0 G_M6413_IG03: C3 ret ; Total bytes of code 17, prolog size 5 for method Program:ToLong(struct):long ; ============================================================ ; Assembly listing for method Program:ToDouble(struct):double ; Emitting BLENDED_CODE for X64 CPU with AVX - Unix ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 1, 1 ) simd16 -> [rsp+0x08] ;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace" ; ; Lcl frame size = 0 G_M24050_IG01: C5F877 vzeroupper 6690 nop G_M24050_IG02: C5FB10442408 vmovsd xmm0, xmmword ptr [rsp+08H] G_M24050_IG03: C3 ret ; Total bytes of code 12, prolog size 5 for method Program:ToDouble(struct):double ; ============================================================ ```
gfoidl commented 5 years ago

vzeroupper is generated for Vector128 code. That should not be there,

Isn't this needed for VEX? No matter if Vector128 or Vector256.

fiigii commented 5 years ago

It is a bit complex, please see https://github.com/dotnet/coreclr/issues/21062. But I am sure that codegen has something wrong.

AndyAyersMS commented 5 years ago

Marking as future; if there's something surgical we can fix, or there's a bug, we can move to 3.0.

omariom commented 5 years ago

It might be related. xmm0 is spilled to the stack.

private static long AsLong(double dbl)
{
    return *(long*)&dbl;
} 
gfoidl commented 4 years ago

@omariom for reference: this is tracked by https://github.com/dotnet/runtime/issues/11413 (thx @EgorBo for the remainder).

hypeartist commented 4 years ago

It might be related. xmm0 is spilled to the stack.

@omariom What about this?

unsafe class C 
{
    private static long AsLong(in double dbl)
    {
        return *(long*)Unsafe.AsPointer(ref Unsafe.AsRef(dbl));
    } 
}

Asm output:

C.AsLong(Double ByRef)
    L0000: mov rax, [rcx]
    L0003: ret

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNbrmwAzGE1IMhDGgG8aDKwzFReC7BgPMkDSRC4BzBgEFcAGQ9PAApeLgYAEwgOYGlI2IBKS2sLamt0pgB2BgAqYPcvHISAVS4dfRY/AAUIMKcoYNhdBlLywT82GF1giMSEjTTrAF9TaiGgA

gfoidl commented 4 years ago

@hypeartist this uses also the stack [rcx] and doesn't operate with registers solely.