Bad codegen for `Vector256.Multiply(Vector256<double>...)` when AVX is available but AVX2 is not

neon-sunset commented 2 months ago

Description

It appears that some operations, like Vector256.Multiply, produce bad codegen on systems where only AVX is available without full AVX2 support despite the underlying ISA supporting full-width multiplication on lanes of double and float types.

Reproduction Steps

create a sample program from dotnet new console --aot template:


using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;

unsafe { // Simulate work var v = stackalloc Vector256[5]; var r = stackalloc Vector256[5]; Test(v, r); }

static unsafe void Test( Vector256 v, Vector256 r ) { for (var k = 0; k < 5; k++) r[k] = Vector256.Multiply(v[k], v[k]); }


2. compile with `dotnet publish -o . -p:IlcInstructionSet=avx`

3. disassemble with Ghidra, Disasmo, etc.

### Expected behavior

`Program:<<Main>$>g__Test|0_0(ulong,ulong)` compiles to
```asm
       xor      eax, eax
       align    [0 bytes for IG03]
G_M000_IG03:                ;; offset=0x0002
       movsxd   r8, eax
       shl      r8, 5
       vmovups  ymm0, ymmword ptr [rcx+r8]
       vmulpd   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rdx+r8], ymm0
       inc      eax
       cmp      eax, 5
       jl       SHORT G_M000_IG03
       vzeroupper 
       ret

Actual behavior

Program:<<Main>$>g__Test|0_0(ulong,ulong) compiles to

       xor      eax, eax
       align    [0 bytes for IG03]
G_M000_IG03:                ;; offset=0x0002
       movsxd   r8, eax
       shl      r8, 5
       vmovups  ymm0, ymmword ptr [rcx+r8]
       vmovaps  ymm1, ymm0
       vmulpd   xmm1, xmm1, xmm1
       vextractf128 xmm0, ymm0, 1
       vmulpd   xmm0, xmm0, xmm0
       vinsertf128 ymm0, ymm1, xmm0, 1
       vmovups  ymmword ptr [rdx+r8], ymm0
       inc      eax
       cmp      eax, 5
       jl       SHORT G_M000_IG03
       vzeroupper 
       ret

Regression?

No

Known Workarounds

Replace Vector256.* method calls (or operators) on paths that use double or float with respective Avx.* alternatives.

This is, however, a significant performance trap the users may not be aware of until the code is executed on or targeted at the system with this specific ISA support flags configuration.

The resulting regressions on more complicated code can result in worse performance than straight-up disabling AVX instead.

Ideally, .NET should not have this kind of rough edges around Vector APIs.

Thanks!

Configuration

.NET SDK:
 Version:           9.0.100-rc.2.24452.3
 Commit:            c294e97a49
 Workload version:  9.0.100-manifests.ba8980d7
 MSBuild version:   17.12.0-preview-24429-01+01e53f416

Other information

No response

dotnet-policy-service[bot] commented 2 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

EgorBo commented 2 months ago

I suspect we have a poor test coverage for AVX1-only hardware (that will fail on AVX2 instructions)

EgorBo commented 2 months ago

Ah, sorry, I though it emits invalid instructions, that's why I put 9.0.0 milestone.

dotnet / runtime