dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.65k forks source link

Slightly suboptimal unrolled SequenceEqual codegen for ROS<byte> vs ROS<char> #95693

Open neon-sunset opened 9 months ago

neon-sunset commented 9 months ago

Description

It appears that ROS<char> produces optimal codegen when calling span.SequenceEqual("[literal]") which is not the case for ROS with span.SequenceEqual("[literal]"u8)

Configuration

.NET SDK:
 Version:           8.0.100
 Commit:            57efcf1350
 Workload version:  8.0.100-manifests.71b9f198

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22631
 OS Platform: Windows
 RID:         win-x64
 Base Path:   C:\Program Files\dotnet\sdk\8.0.100\

Regression?

No

Data

C

// Pad both to be 32B
static bool Eq8(ReadOnlySpan<byte> input) =>
    input.SequenceEqual("Hello, World! Hello, World! ----"u8);

static bool Eq16(ReadOnlySpan<char> input) =>
    input.SequenceEqual("Hello, World! --");

Eq8

; Method Program:<<Main>$>g__Eq8|0_1(System.ReadOnlySpan`1[ubyte]):bool (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       vzeroupper 

G_M000_IG02:                ;; offset=0x0003
       mov      rax, 0x215B81EDDA8
       mov      rdx, bword ptr [rcx]
       mov      ecx, dword ptr [rcx+0x08]
       cmp      ecx, 32
       jne      SHORT G_M000_IG04

G_M000_IG03:                ;; offset=0x0018
       vmovups  ymm0, ymmword ptr [rdx]
       vpcmpeqq ymm0, ymm0, ymmword ptr [rax]
       vpmovmskb eax, ymm0
       cmp      eax, -1
       sete     al
       movzx    rax, al
       jmp      SHORT G_M000_IG05

G_M000_IG04:                ;; offset=0x0030
       xor      eax, eax

G_M000_IG05:                ;; offset=0x0032
       vzeroupper 
       ret      
; Total bytes of code: 54

Eq16

; Method Program:<<Main>$>g__Eq16|0_2(System.ReadOnlySpan`1[ushort]):bool (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       vzeroupper 

G_M000_IG02:                ;; offset=0x0003
       cmp      dword ptr [rcx+0x08], 16
       jne      SHORT G_M000_IG04

G_M000_IG03:                ;; offset=0x0009
       mov      rax, bword ptr [rcx]
       vmovups  ymm0, ymmword ptr [rax]
       vpxor    ymm0, ymm0, ymmword ptr [reloc @RWD00]
       vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
       jmp      SHORT G_M000_IG05

G_M000_IG04:                ;; offset=0x0025
       xor      eax, eax

G_M000_IG05:                ;; offset=0x0027
       vzeroupper 
       ret      
RWD00   dq  006C006C00650048h, 00570020002C006Fh, 0064006C0072006Fh, 002D002D00200021h
; Total bytes of code: 43

Analysis

For lengths that do not perfectly round to vector width, there are further differences.

Please see https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACY8pJ0hgYQYG8aH+mLBsAgQANgwCiARwAcACgBKMbABMA8gDsxATwDKAB2yaAPMB0YYAPgYBLTQYCuGAJQMAvFb4Cf9pxgA6PRhpRxhNMBgZR2wxeQAiAAkYMTEINAYAdWgxVQBCBmTU9Kyc/IY4Srh4x1kXAG4ab35mVhFxKWkWJRUNbX0jUzAAC2woGz9nN09mnzsHZyCQsIio0NiEorSM7Khcgsr4htnZ1uFRCRlZPWHoDB61LV1DYzMLa3n/aYmFwODQ8KRaIbJIpNJ5Gp1RrUHynITtS5dJA3O4PPrPQYmEZjH5fDy4xb/FZA9ZxUHFCHHagAXyAA

ghost commented 9 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
### Description It appears that `ROS` produces perfectly optimal codegen when calling `span.SequenceEqual("[literal]")` which is not the case for ROS with `span.SequenceEqual("[literal]"u8)` ### Configuration ``` .NET SDK: Version: 8.0.100 Commit: 57efcf1350 Workload version: 8.0.100-manifests.71b9f198 Runtime Environment: OS Name: Windows OS Version: 10.0.22631 OS Platform: Windows RID: win-x64 Base Path: C:\Program Files\dotnet\sdk\8.0.100\ ``` ### Regression? No ### Data #### C# ```csharp // Pad both to be 32B static bool Eq8(ReadOnlySpan input) => input.SequenceEqual("Hello, World! Hello, World! ----"u8); static bool Eq16(ReadOnlySpan input) => input.SequenceEqual("Hello, World! --"); ``` #### `Eq8` ```asm ; Method Program:<
$>g__Eq8|0_1(System.ReadOnlySpan`1[ubyte]):bool (FullOpts) G_M000_IG01: ;; offset=0x0000 vzeroupper G_M000_IG02: ;; offset=0x0003 mov rax, 0x215B81EDDA8 mov rdx, bword ptr [rcx] mov ecx, dword ptr [rcx+0x08] cmp ecx, 32 jne SHORT G_M000_IG04 G_M000_IG03: ;; offset=0x0018 vmovups ymm0, ymmword ptr [rdx] vpcmpeqq ymm0, ymm0, ymmword ptr [rax] vpmovmskb eax, ymm0 cmp eax, -1 sete al movzx rax, al jmp SHORT G_M000_IG05 G_M000_IG04: ;; offset=0x0030 xor eax, eax G_M000_IG05: ;; offset=0x0032 vzeroupper ret ; Total bytes of code: 54 ``` #### `Eq16` ```asm ; Method Program:<
$>g__Eq16|0_2(System.ReadOnlySpan`1[ushort]):bool (FullOpts) G_M000_IG01: ;; offset=0x0000 vzeroupper G_M000_IG02: ;; offset=0x0003 cmp dword ptr [rcx+0x08], 16 jne SHORT G_M000_IG04 G_M000_IG03: ;; offset=0x0009 mov rax, bword ptr [rcx] vmovups ymm0, ymmword ptr [rax] vpxor ymm0, ymm0, ymmword ptr [reloc @RWD00] vptest ymm0, ymm0 sete al movzx rax, al jmp SHORT G_M000_IG05 G_M000_IG04: ;; offset=0x0025 xor eax, eax G_M000_IG05: ;; offset=0x0027 vzeroupper ret RWD00 dq 006C006C00650048h, 00570020002C006Fh, 0064006C0072006Fh, 002D002D00200021h ; Total bytes of code: 43 ``` ### Analysis For lengths that do not perfectly round to vector width, there are further differences. Please see https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACY8pJ0hgYQYG8aH+mLBsAgQANgwCiARwAcACgBKMbABMA8gDsxATwDKAB2yaAPMB0YYAPgYBLTQYCuGAJQMAvFb4Cf9pxgA6PRhpRxhNMBgZR2wxeQAiAAkYMTEINAYAdWgxVQBCBmTU9Kyc/IY4Srh4x1kXAG4ab35mVhFxKWkWJRUNbX0jUzAAC2woGz9nN09mnzsHZyCQsIio0NiEorSM7Khcgsr4htnZ1uFRCRlZPWHoDB61LV1DYzMLa3n/aYmFwODQ8KRaIbJIpNJ5Gp1RrUHynITtS5dJA3O4PPrPQYmEZjH5fDy4xb/FZA9ZxUHFCHHagAXyAA
Author: neon-sunset
Assignees: -
Labels: `tenet-performance`, `area-CodeGen-coreclr`
Milestone: -
EgorBo commented 9 months ago

The key difference between these two is that for u8 (spans) we unroll them in a late phase (lowering) where we don't see that one of the inputs is a constant, we only know that the length is const.

So we don't have a mechanism to get the actual RVA content, but, IMO it's not a big deal and should be more or less the same perf.