dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

String.Replace(char, char) now slower in some cases #74771

Closed danmoseley closed 1 year ago

danmoseley commented 2 years ago

Moving discussion from the PR https://github.com/dotnet/runtime/pull/67049

@gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g.

const string Input = """
    Whose woods these are I think I know.
    His house is in the village though;
    He will not see me stopping here
    To watch his woods fill up with snow.
    My little horse must think it queer
    To stop without a farmhouse near
    Between the woods and frozen lake
    The darkest evening of the year.
    He gives his harness bells a shake
    To ask if there is some mistake.
    The only other sound’s the sweep
    Of easy wind and downy flake.
    The woods are lovely, dark and deep,
    But I have promises to keep,
    And miles to go before I sleep,
    And miles to go before I sleep.
    """;

[Benchmark]
public string Replace() => Input.Replace('I', 'U');

Method Runtime Mean Ratio Replace .NET 6.0 108.1 ns 1.00 Replace .NET 7.0 136.0 ns 1.26 Do you see otherwise?

@gfoidl

gfoidl commented yesterday Hm, that is not expected...

When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-preview.7.22377.5 [Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT

Method Mean Error StdDev Median Ratio RatioSD Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00 PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07 so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass.

I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK.

Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7? How can I investigate this further -- need some guidance on how to check code-layout please.

@stephentoub stephentoub commented yesterday • Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise?

@tannergooding tannergooding commented yesterday When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see

This could be related to stale PGO data

@danmoseley danmoseley commented yesterday Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it.

@danmoseley danmoseley commented yesterday Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches?

@stephentoub stephentoub commented yesterday • I don't think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7.

@danmoseley danmoseley commented 21 hours ago • I ran the example above with

    var config = DefaultConfig.Instance
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline())
        .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"));
    BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config);

and got

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-rc.2.22426.5 [Host] : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256

Method EnvironmentVariables Runtime Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Replace COMPlus_JitDisablePGO=0 .NET 6.0 130.5 ns 6.76 ns 18.51 ns 124.0 ns 0.92 0.17 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=0 .NET 7.0 144.0 ns 2.95 ns 5.54 ns 142.5 ns 1.00 0.00 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=0 .NET Core 3.1 822.1 ns 16.09 ns 23.07 ns 814.0 ns 5.69 0.31 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=0 .NET Framework 4.8 750.2 ns 28.86 ns 82.82 ns 730.3 ns 4.97 0.49 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET 6.0 127.1 ns 2.64 ns 4.75 ns 126.4 ns 0.88 0.05 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET 7.0 144.5 ns 2.96 ns 5.97 ns 144.1 ns 1.01 0.06 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET Core 3.1 936.2 ns 17.96 ns 22.06 ns 931.9 ns 6.50 0.37 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET Framework 4.8 673.2 ns 12.41 ns 23.91 ns 670.5 ns 4.68 0.23 0.3262 1.34 KB 1.00

code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251

(should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )

@danmoseley danmoseley commented 21 hours ago and just for interest

Method EnvironmentVariables Runtime Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Replace COMPlus_JitDisablePGO=1 .NET 6.0 127.8 ns 2.55 ns 5.91 ns 125.8 ns 0.95 0.05 0.3266 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET 7.0 141.0 ns 2.73 ns 2.42 ns 141.1 ns 1.00 0.00 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 .NET 6.0 163.9 ns 3.35 ns 4.81 ns 163.8 ns 1.15 0.05 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 .NET 7.0 184.9 ns 3.59 ns 4.79 ns 183.7 ns 1.32 0.05 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 .NET 6.0 176.1 ns 3.44 ns 4.09 ns 175.9 ns 1.25 0.03 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 .NET 7.0 192.1 ns 3.81 ns 4.53 ns 190.1 ns 1.37 0.05 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 .NET 6.0 1,057.4 ns 20.95 ns 40.86 ns 1,047.2 ns 7.65 0.35 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 .NET 7.0 947.1 ns 13.34 ns 11.83 ns 948.3 ns 6.72 0.15 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 .NET 6.0 496.0 ns 51.61 ns 152.17 ns 463.3 ns 3.67 1.67 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 .NET 7.0 395.3 ns 14.32 ns 41.10 ns 388.4 ns 2.95 0.27 0.3271 1.34 KB 1.00

@gfoidl gfoidl commented 9 hours ago Do you see a similar 6 vs 7 difference as I do?

Yes (sorry for slow response, was Sunday...). @danmoseley thanks for your numbers.

This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments.

; Program.Replace()
       mov       rcx,1C003C090A0
       mov       rcx,[rcx]
       mov       edx,49
       mov       r8d,55
       jmp       qword ptr [7FFEFA7430C0]
; Total bytes of code 30

; System.String.Replace(Char, Char)
       push      r15
       push      r14
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,28
       vzeroupper
       mov       rsi,rcx
       mov       edi,edx
       mov       ebx,r8d
       movzx     ecx,di
       movzx     r8d,bx
       cmp       ecx,r8d
       je        near ptr M01_L09
       lea       rcx,[rsi+0C]
       mov       r8d,[rsi+8]
       movsx     rdx,di
       call      qword ptr [7FFEFA7433C0]
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L00
       mov       rax,rsi                ; uncommon case, could jump to M01_L09 instead
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L00:
       mov       ecx,[rsi+8]
       sub       ecx,ebp
       mov       r14d,ecx
       mov       ecx,[rsi+8]
       call      System.String.FastAllocateString(Int32)
       mov       r15,rax
       test      ebp,ebp
       jg        near ptr M01_L10       ; should be common path, I don't expect to jump to the end, then back to here
M01_L01:
       mov       eax,ebp
       lea       rax,[rsi+rax*2+0C]
       cmp       [r15],r15b
       mov       edx,ebp
       lea       rdx,[r15+rdx*2+0C]
       xor       ecx,ecx
       cmp       dword ptr [rsi+8],10
       jl        near ptr M01_L07
       movzx     r8d,di
       imul      r8d,10001              ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0           ; should be vpbroadcastb, see comment above
       movzx     r8d,bx
       imul      r8d,10001
       vmovd     xmm1,r8d
       vpbroadcastd ymm1,xmm1           ; vpbroadcastb (see above)
       cmp       r14,10
       jbe       short M01_L03
       add       r14,0FFFFFFFFFFFFFFF0
M01_L02:
       lea       r8,[rax+rcx*2]
       vmovupd   ymm2,[r8]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm3,ymm1         ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this
       vpandn    ymm2,ymm3,ymm2         ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but
       vpor      ymm2,ymm4,ymm2         ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series
       lea       r8,[rdx+rcx*2]
       vmovupd   [r8],ymm2
       add       rcx,10
       cmp       rcx,r14
       jb        short M01_L02
M01_L03:
       mov       ecx,[rsi+8]
       add       ecx,0FFFFFFF0
       add       rsi,0C
       lea       rsi,[rsi+rcx*2]
       vmovupd   ymm2,[rsi]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm0,ymm3,ymm1
       vpandn    ymm1,ymm3,ymm2
       vpor      ymm2,ymm0,ymm1
       lea       rax,[r15+0C]
       lea       rax,[rax+rcx*2]
       vmovupd   [rax],ymm2
       jmp       short M01_L08
M01_L04:
       movzx     r8d,word ptr [rax+rcx*2]
       lea       r9,[rdx+rcx*2]
       movzx     r10d,di
       cmp       r8d,r10d
       je        short M01_L05          ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but
       jmp       short M01_L06          ; one jump could be avoided?!
M01_L05:
       movzx     r8d,bx
M01_L06:
       mov       [r9],r8w
       inc       rcx
M01_L07:
       cmp       rcx,r14
       jb        short M01_L04
M01_L08:
       mov       rax,r15
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L09:                                ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to
       mov       rax,rsi                ; get less machine code?
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L10:                                ; this block should be common enough, so should be on the jump-root (see comment above)
       cmp       [r15],r15b             ; it's the Memmove-call
       lea       rcx,[r15+0C]
       lea       rdx,[rsi+0C]
       mov       r8d,ebp
       add       r8,r8
       call      qword ptr [7FFEFA7399F0]
       jmp       near ptr M01_L01
; Total bytes of code 383

So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path. But I doubt that this allone is the cause for the regression.

I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits. If I use a string.Replace-duplicated code for the benchmark, then it's emitted which is what I expect as https://github.com/dotnet/runtime/commit/10d8a36ab669ac95f554e5efcc3c8780b5c50f11 got merged on 2022-05-25. But that shouldn't cause the regression either, as for .NET 6 the same series of vector-instruction are emitted.

The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong).

So far the "static analysis", but I doubt this is enough. With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment. I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I'm eager to learn).

Machine code for .NET 6 (for reference) ```asm ; System.String.Replace(Char, Char) push r15 push r14 push rdi push rsi push rbp push rbx sub rsp,28 vzeroupper mov rsi,rcx movzx edi,dx movzx ebx,r8w cmp edi,ebx jne short M01_L00 mov rax,rsi vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L00: lea rbp,[rsi+0C] mov rcx,rbp mov r14d,[rsi+8] mov r8d,r14d mov edx,edi call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32) mov r15d,eax test r15d,r15d jge short M01_L01 mov rax,rsi vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L01: mov esi,r14d sub esi,r15d mov ecx,r14d call System.String.FastAllocateString(Int32) mov r14,rax test r15d,r15d jle short M01_L02 cmp [r14],r14d lea rcx,[r14+0C] mov rdx,rbp mov r8d,r15d add r8,r8 call System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr) M01_L02: movsxd rax,r15d add rax,rax add rbp,rax cmp [r14],r14d lea rdx,[r14+0C] add rdx,rax cmp esi,10 jl short M01_L04 imul eax,edi,10001 vmovd xmm0,eax vpbroadcastd ymm0,xmm0 imul eax,ebx,10001 vmovd xmm1,eax vpbroadcastd ymm1,xmm1 M01_L03: vmovupd ymm2,[rbp] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm4,ymm1,ymm3 vpandn ymm2,ymm3,ymm2 vpor ymm2,ymm4,ymm2 vmovupd [rdx],ymm2 add rbp,20 add rdx,20 add esi,0FFFFFFF0 cmp esi,10 jge short M01_L03 M01_L04: test esi,esi jle short M01_L08 nop word ptr [rax+rax] M01_L05: movzx eax,word ptr [rbp] mov rcx,rdx cmp eax,edi je short M01_L06 jmp short M01_L07 M01_L06: mov eax,ebx M01_L07: mov [rcx],ax add rbp,2 add rdx,2 dec esi test esi,esi jg short M01_L05 M01_L08: mov rax,r14 vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ; Total bytes of code 307 ```

@AndyAyersMS

AndyAyersMS commented 2 hours ago (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )

Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes.

For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
Moving discussion from the PR https://github.com/dotnet/runtime/pull/67049 @gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g. ```c# const string Input = """ Whose woods these are I think I know. His house is in the village though; He will not see me stopping here To watch his woods fill up with snow. My little horse must think it queer To stop without a farmhouse near Between the woods and frozen lake The darkest evening of the year. He gives his harness bells a shake To ask if there is some mistake. The only other sound’s the sweep Of easy wind and downy flake. The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. """; [Benchmark] public string Replace() => Input.Replace('I', 'U'); ``` Method Runtime Mean Ratio Replace .NET 6.0 108.1 ns 1.00 Replace .NET 7.0 136.0 ns 1.26 Do you see otherwise? @gfoidl [gfoidl](https://github.com/gfoidl) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229429727) Hm, that is not expected... When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-preview.7.22377.5 [Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT Method Mean Error StdDev Median Ratio RatioSD Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00 PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07 so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass. I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK. Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7? How can I investigate this further -- need some guidance on how to check code-layout please. @stephentoub [stephentoub](https://github.com/stephentoub) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229431842) • Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise? @tannergooding [tannergooding](https://github.com/tannergooding) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229492295) When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see This could be related to stale PGO data @danmoseley [danmoseley](https://github.com/danmoseley) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229507483) Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it. @danmoseley [danmoseley](https://github.com/danmoseley) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229507793) Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches? @stephentoub [stephentoub](https://github.com/stephentoub) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229508682) • I don't think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7. @danmoseley [danmoseley](https://github.com/danmoseley) commented [21 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229552351) • I ran the example above with var config = DefaultConfig.Instance .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0")) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0")) .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline()) .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0")); BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config); and got BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-rc.2.22426.5 [Host] : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 | Method | EnvironmentVariables | Runtime | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------- |------------------------ |------------------- |---------:|---------:|---------:|---------:|------:|--------:|-------:|----------:|------------:| | Replace | COMPlus_JitDisablePGO=0 | .NET 6.0 | 130.5 ns | 6.76 ns | 18.51 ns | 124.0 ns | 0.92 | 0.17 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=0 | .NET 7.0 | 144.0 ns | 2.95 ns | 5.54 ns | 142.5 ns | 1.00 | 0.00 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=0 | .NET Core 3.1 | 822.1 ns | 16.09 ns | 23.07 ns | 814.0 ns | 5.69 | 0.31 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=0 | .NET Framework 4.8 | 750.2 ns | 28.86 ns | 82.82 ns | 730.3 ns | 4.97 | 0.49 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET 6.0 | 127.1 ns | 2.64 ns | 4.75 ns | 126.4 ns | 0.88 | 0.05 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET 7.0 | 144.5 ns | 2.96 ns | 5.97 ns | 144.1 ns | 1.01 | 0.06 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET Core 3.1 | 936.2 ns | 17.96 ns | 22.06 ns | 931.9 ns | 6.50 | 0.37 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET Framework 4.8 | 673.2 ns | 12.41 ns | 23.91 ns | 670.5 ns | 4.68 | 0.23 | 0.3262 | 1.34 KB | 1.00 | code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251 (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS ) @danmoseley [danmoseley](https://github.com/danmoseley) commented [21 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229555272) and just for interest | Method | EnvironmentVariables | Runtime | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------- |---------------------------------------------------- |--------- |-----------:|---------:|----------:|-----------:|------:|--------:|-------:|----------:|------------:| | Replace | COMPlus_JitDisablePGO=1 | .NET 6.0 | 127.8 ns | 2.55 ns | 5.91 ns | 125.8 ns | 0.95 | 0.05 | 0.3266 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET 7.0 | 141.0 ns | 2.73 ns | 2.42 ns | 141.1 ns | 1.00 | 0.00 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 6.0 | 163.9 ns | 3.35 ns | 4.81 ns | 163.8 ns | 1.15 | 0.05 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 7.0 | 184.9 ns | 3.59 ns | 4.79 ns | 183.7 ns | 1.32 | 0.05 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 6.0 | 176.1 ns | 3.44 ns | 4.09 ns | 175.9 ns | 1.25 | 0.03 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 7.0 | 192.1 ns | 3.81 ns | 4.53 ns | 190.1 ns | 1.37 | 0.05 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 6.0 | 1,057.4 ns | 20.95 ns | 40.86 ns | 1,047.2 ns | 7.65 | 0.35 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 7.0 | 947.1 ns | 13.34 ns | 11.83 ns | 948.3 ns | 6.72 | 0.15 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 6.0 | 496.0 ns | 51.61 ns | 152.17 ns | 463.3 ns | 3.67 | 1.67 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 7.0 | 395.3 ns | 14.32 ns | 41.10 ns | 388.4 ns | 2.95 | 0.27 | 0.3271 | 1.34 KB | 1.00 | @gfoidl [gfoidl](https://github.com/gfoidl) commented [9 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229954693) Do you see a similar 6 vs 7 difference as I do? Yes (sorry for slow response, was Sunday...). @danmoseley thanks for your numbers. This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments. ```asm ; Program.Replace() mov rcx,1C003C090A0 mov rcx,[rcx] mov edx,49 mov r8d,55 jmp qword ptr [7FFEFA7430C0] ; Total bytes of code 30 ; System.String.Replace(Char, Char) push r15 push r14 push rdi push rsi push rbp push rbx sub rsp,28 vzeroupper mov rsi,rcx mov edi,edx mov ebx,r8d movzx ecx,di movzx r8d,bx cmp ecx,r8d je near ptr M01_L09 lea rcx,[rsi+0C] mov r8d,[rsi+8] movsx rdx,di call qword ptr [7FFEFA7433C0] mov ebp,eax test ebp,ebp jge short M01_L00 mov rax,rsi ; uncommon case, could jump to M01_L09 instead vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L00: mov ecx,[rsi+8] sub ecx,ebp mov r14d,ecx mov ecx,[rsi+8] call System.String.FastAllocateString(Int32) mov r15,rax test ebp,ebp jg near ptr M01_L10 ; should be common path, I don't expect to jump to the end, then back to here M01_L01: mov eax,ebp lea rax,[rsi+rax*2+0C] cmp [r15],r15b mov edx,ebp lea rdx,[r15+rdx*2+0C] xor ecx,ecx cmp dword ptr [rsi+8],10 jl near ptr M01_L07 movzx r8d,di imul r8d,10001 ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected vmovd xmm0,r8d vpbroadcastd ymm0,xmm0 ; should be vpbroadcastb, see comment above movzx r8d,bx imul r8d,10001 vmovd xmm1,r8d vpbroadcastd ymm1,xmm1 ; vpbroadcastb (see above) cmp r14,10 jbe short M01_L03 add r14,0FFFFFFFFFFFFFFF0 M01_L02: lea r8,[rax+rcx*2] vmovupd ymm2,[r8] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm4,ymm3,ymm1 ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this vpandn ymm2,ymm3,ymm2 ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but vpor ymm2,ymm4,ymm2 ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series lea r8,[rdx+rcx*2] vmovupd [r8],ymm2 add rcx,10 cmp rcx,r14 jb short M01_L02 M01_L03: mov ecx,[rsi+8] add ecx,0FFFFFFF0 add rsi,0C lea rsi,[rsi+rcx*2] vmovupd ymm2,[rsi] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm0,ymm3,ymm1 vpandn ymm1,ymm3,ymm2 vpor ymm2,ymm0,ymm1 lea rax,[r15+0C] lea rax,[rax+rcx*2] vmovupd [rax],ymm2 jmp short M01_L08 M01_L04: movzx r8d,word ptr [rax+rcx*2] lea r9,[rdx+rcx*2] movzx r10d,di cmp r8d,r10d je short M01_L05 ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but jmp short M01_L06 ; one jump could be avoided?! M01_L05: movzx r8d,bx M01_L06: mov [r9],r8w inc rcx M01_L07: cmp rcx,r14 jb short M01_L04 M01_L08: mov rax,r15 vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L09: ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to mov rax,rsi ; get less machine code? vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L10: ; this block should be common enough, so should be on the jump-root (see comment above) cmp [r15],r15b ; it's the Memmove-call lea rcx,[r15+0C] lea rdx,[rsi+0C] mov r8d,ebp add r8,r8 call qword ptr [7FFEFA7399F0] jmp near ptr M01_L01 ; Total bytes of code 383 ``` So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path. But I doubt that this allone is the cause for the regression. I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits. If I use a string.Replace-duplicated code for the benchmark, then it's emitted which is what I expect as https://github.com/dotnet/runtime/commit/10d8a36ab669ac95f554e5efcc3c8780b5c50f11 got merged on 2022-05-25. But that shouldn't cause the regression either, as for .NET 6 the same series of vector-instruction are emitted. The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong). So far the "static analysis", but I doubt this is enough. With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment. I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I'm eager to learn). Machine code for .NET 6 (for reference) @AndyAyersMS [AndyAyersMS](https://github.com/AndyAyersMS) commented [2 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1230539420) (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS ) Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes. For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).
Author: danmoseley
Assignees: -
Labels: `area-System.Runtime`, `tenet-performance`, `regression-from-last-release`
Milestone: 7.0.0
danmoseley commented 2 years ago

For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).

Given the regression exists with PGO disabled, maybe it isn't critical.

@dotnet/jit-contrib can someone on the JIT team comment on @gfoidl 's findings above? We are trying to figure out why there has been a regression in some cases.

@stephentoub did you try a variety of tests (length of input and number of hits) and see a general regression?

gfoidl commented 2 years ago

with PGO disabled

We can only disable dynamic PGO (D-PGO), not the static PGO. Correct? So if static PGO is the culprit here, then JitDisablePGO=1 shouldn't show serious effect, which the numbers prove.

gfoidl commented 2 years ago

With JitDisasm and JitDisablePGO not set I get

; Assembly listing for method String:Replace(ushort,ushort):String:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Static PGO: edge weights are invalid, and fgCalledCount is 47166
; 1 inlinees with PGO data; 16 single block inlinees; 0 inlinees without PGO data

What is the meaning of "edge weights are invalid"?

With JitDisablePGO set, then it's

; Assembly listing for method String:Replace(ushort,ushort):String:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePgo > 0
; 0 inlinees with PGO data; 16 single block inlinees; 2 inlinees without PGO data
full assembly listing ### `JitDisablePGO` not set ```asm ; Assembly listing for method String:Replace(ushort,ushort):String:this ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; Tier-1 compilation ; optimized code ; optimized using profile data ; rsp based frame ; fully interruptible ; with Static PGO: edge weights are invalid, and fgCalledCount is 47166 ; 1 inlinees with PGO data; 16 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 4157 push r15 4156 push r14 57 push rdi 56 push rsi 55 push rbp 53 push rbx 4883EC28 sub rsp, 40 C5F877 vzeroupper 488BF1 mov rsi, rcx 8BFA mov edi, edx 418BD8 mov ebx, r8d G_M000_IG02: ;; offset=0017H 0FB7CF movzx rcx, di 440FB7C3 movzx r8, bx 413BC8 cmp ecx, r8d 0F8433010000 je G_M000_IG16 G_M000_IG03: ;; offset=0027H 488D4E0C lea rcx, bword ptr [rsi+0CH] 448B4608 mov r8d, dword ptr [rsi+08H] 480FBFD7 movsx rdx, di FF154F1CC9FF call [SpanHelpers:IndexOfValueType(byref,short,int):int] 8BE8 mov ebp, eax 85ED test ebp, ebp 7D13 jge SHORT G_M000_IG06 G_M000_IG04: ;; offset=003FH 488BC6 mov rax, rsi G_M000_IG05: ;; offset=0042H C5F877 vzeroupper 4883C428 add rsp, 40 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 415F pop r15 C3 ret G_M000_IG06: ;; offset=0052H 8B4E08 mov ecx, dword ptr [rsi+08H] 2BCD sub ecx, ebp 448BF1 mov r14d, ecx 8B4E08 mov ecx, dword ptr [rsi+08H] E85E5E82FF call String:FastAllocateString(int):String 4C8BF8 mov r15, rax 85ED test ebp, ebp 0F8F00010000 jg G_M000_IG18 G_M000_IG07: ;; offset=006DH 8BC5 mov eax, ebp 488D44460C lea rax, bword ptr [rsi+2*rax+0CH] 45383F cmp byte ptr [r15], r15b 8BD5 mov edx, ebp 498D54570C lea rdx, bword ptr [r15+2*rdx+0CH] 33C9 xor ecx, ecx 837E0810 cmp dword ptr [rsi+08H], 16 0F8CB8000000 jl G_M000_IG13 440FB7C7 movzx r8, di 4569C001000100 imul r8d, r8d, 0x10001 C4C1796EC0 vmovd xmm0, r8d C4E27D58C0 vpbroadcastd ymm0, ymm0 440FB7C3 movzx r8, bx 4569C001000100 imul r8d, r8d, 0x10001 C4C1796EC8 vmovd xmm1, r8d C4E27D58C9 vpbroadcastd ymm1, ymm1 4983FE10 cmp r14, 16 7633 jbe SHORT G_M000_IG09 4983C6F0 add r14, -16 G_M000_IG08: ;; offset=00BEH 4C8D0448 lea r8, bword ptr [rax+2*rcx] C4C17D1010 vmovupd ymm2, ymmword ptr[r8] C5ED75D8 vpcmpeqw ymm3, ymm2, ymm0 C5E5DBE1 vpand ymm4, ymm3, ymm1 C5E5DFD2 vpandn ymm2, ymm3, ymm2 C5DDEBD2 vpor ymm2, ymm4, ymm2 4C8D044A lea r8, bword ptr [rdx+2*rcx] C4C17D1110 vmovupd ymmword ptr[r8], ymm2 4883C110 add rcx, 16 493BCE cmp rcx, r14 72D5 jb SHORT G_M000_IG08 G_M000_IG09: ;; offset=00E9H 8B4E08 mov ecx, dword ptr [rsi+08H] 83C1F0 add ecx, -16 4883C60C add rsi, 12 488D344E lea rsi, bword ptr [rsi+2*rcx] C5FD1016 vmovupd ymm2, ymmword ptr[rsi] C5ED75D8 vpcmpeqw ymm3, ymm2, ymm0 C5E5DBC1 vpand ymm0, ymm3, ymm1 C5E5DFCA vpandn ymm1, ymm3, ymm2 C5FDEBD1 vpor ymm2, ymm0, ymm1 498D470C lea rax, bword ptr [r15+0CH] 488D0448 lea rax, bword ptr [rax+2*rcx] C5FD1110 vmovupd ymmword ptr[rax], ymm2 EB24 jmp SHORT G_M000_IG14 G_M000_IG10: ;; offset=0119H 440FB70448 movzx r8, word ptr [rax+2*rcx] 4C8D0C4A lea r9, bword ptr [rdx+2*rcx] 440FB7D7 movzx r10, di 453BC2 cmp r8d, r10d 7402 je SHORT G_M000_IG11 EB04 jmp SHORT G_M000_IG12 G_M000_IG11: ;; offset=012DH 440FB7C3 movzx r8, bx G_M000_IG12: ;; offset=0131H 66458901 mov word ptr [r9], r8w 48FFC1 inc rcx G_M000_IG13: ;; offset=0138H 493BCE cmp rcx, r14 72DC jb SHORT G_M000_IG10 G_M000_IG14: ;; offset=013DH 498BC7 mov rax, r15 G_M000_IG15: ;; offset=0140H C5F877 vzeroupper 4883C428 add rsp, 40 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 415F pop r15 C3 ret G_M000_IG16: ;; offset=0150H 488BC6 mov rax, rsi G_M000_IG17: ;; offset=0153H C5F877 vzeroupper 4883C428 add rsp, 40 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 415F pop r15 C3 ret G_M000_IG18: ;; offset=0163H 45383F cmp byte ptr [r15], r15b 498D4F0C lea rcx, bword ptr [r15+0CH] 488D560C lea rdx, bword ptr [rsi+0CH] 448BC5 mov r8d, ebp 4D03C0 add r8, r8 FF1536F786FF call [Buffer:Memmove(byref,byref,long)] E9EEFEFFFF jmp G_M000_IG07 ; Total bytes of code 383 ``` ### `JitDisablePGO` set ```asm ; Assembly listing for method String:Replace(ushort,ushort):String:this ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; fully interruptible ; PGO data available, but JitDisablePgo > 0 ; 0 inlinees with PGO data; 16 single block inlinees; 2 inlinees without PGO data G_M000_IG01: ;; offset=0000H 4157 push r15 4156 push r14 4154 push r12 57 push rdi 56 push rsi 55 push rbp 53 push rbx 4883EC20 sub rsp, 32 C5F877 vzeroupper 488BF1 mov rsi, rcx G_M000_IG02: ;; offset=0014H 0FB7FA movzx rdi, dx 410FB7D8 movzx rbx, r8w 3BFB cmp edi, ebx 7515 jne SHORT G_M000_IG05 G_M000_IG03: ;; offset=001FH 488BC6 mov rax, rsi G_M000_IG04: ;; offset=0022H C5F877 vzeroupper 4883C420 add rsp, 32 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415C pop r12 415E pop r14 415F pop r15 C3 ret G_M000_IG05: ;; offset=0034H 488D6E0C lea rbp, bword ptr [rsi+0CH] 488BCD mov rcx, rbp 448B7608 mov r14d, dword ptr [rsi+08H] 458BC6 mov r8d, r14d 480FBFD2 movsx rdx, dx FF15BC40C9FF call [SpanHelpers:IndexOfValueType(byref,short,int):int] 448BF8 mov r15d, eax 4585FF test r15d, r15d 7D15 jge SHORT G_M000_IG07 488BC6 mov rax, rsi G_M000_IG06: ;; offset=0057H C5F877 vzeroupper 4883C420 add rsp, 32 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415C pop r12 415E pop r14 415F pop r15 C3 ret G_M000_IG07: ;; offset=0069H 418BCE mov ecx, r14d 412BCF sub ecx, r15d 8BF1 mov esi, ecx 418BCE mov ecx, r14d E8C78282FF call String:FastAllocateString(int):String 4C8BE0 mov r12, rax 4585FF test r15d, r15d 7E18 jle SHORT G_M000_IG08 45382424 cmp byte ptr [r12], r12b 498D4C240C lea rcx, bword ptr [r12+0CH] 488BD5 mov rdx, rbp 458BC7 mov r8d, r15d 4D03C0 add r8, r8 FF15971C87FF call [Buffer:Memmove(byref,byref,long)] G_M000_IG08: ;; offset=0099H 458BC7 mov r8d, r15d 4D03C0 add r8, r8 498D0428 lea rax, bword ptr [r8+rbp] 45382424 cmp byte ptr [r12], r12b 4B8D54040C lea rdx, bword ptr [r12+r8+0CH] 33C9 xor ecx, ecx 4183FE10 cmp r14d, 16 0F8CAE000000 jl G_M000_IG14 4469C701000100 imul r8d, edi, 0x10001 C4C1796EC0 vmovd xmm0, r8d C4E27D58C0 vpbroadcastd ymm0, ymm0 4469C301000100 imul r8d, ebx, 0x10001 C4C1796EC8 vmovd xmm1, r8d C4E27D58C9 vpbroadcastd ymm1, ymm1 4883FE10 cmp rsi, 16 7636 jbe SHORT G_M000_IG10 4883C6F0 add rsi, -16 align [0 bytes for IG09] G_M000_IG09: ;; offset=00E4H 4C8D0409 lea r8, [rcx+rcx] 4E8D0C00 lea r9, bword ptr [rax+r8] C4C17D1011 vmovupd ymm2, ymmword ptr[r9] C5ED75D8 vpcmpeqw ymm3, ymm2, ymm0 C5E5DBE1 vpand ymm4, ymm3, ymm1 C5E5DFD2 vpandn ymm2, ymm3, ymm2 C5DDEBD2 vpor ymm2, ymm4, ymm2 4C03C2 add r8, rdx C4C17D1110 vmovupd ymmword ptr[r8], ymm2 4883C110 add rcx, 16 483BCE cmp rcx, rsi 72D2 jb SHORT G_M000_IG09 G_M000_IG10: ;; offset=0112H 4183C6F0 add r14d, -16 418BCE mov ecx, r14d 488D6C4D00 lea rbp, bword ptr [rbp+2*rcx] C5FD105500 vmovupd ymm2, ymmword ptr[rbp] C5ED75D8 vpcmpeqw ymm3, ymm2, ymm0 C5E5DBC1 vpand ymm0, ymm3, ymm1 C5E5DFCA vpandn ymm1, ymm3, ymm2 C5FDEBD1 vpor ymm2, ymm0, ymm1 498D44240C lea rax, bword ptr [r12+0CH] 488D0448 lea rax, bword ptr [rax+2*rcx] C5FD1110 vmovupd ymmword ptr[rax], ymm2 EB1F jmp SHORT G_M000_IG15 G_M000_IG11: ;; offset=0142H 440FB70448 movzx r8, word ptr [rax+2*rcx] 4C8D0C4A lea r9, bword ptr [rdx+2*rcx] 443BC7 cmp r8d, edi 7402 je SHORT G_M000_IG12 EB03 jmp SHORT G_M000_IG13 G_M000_IG12: ;; offset=0152H 448BC3 mov r8d, ebx G_M000_IG13: ;; offset=0155H 66458901 mov word ptr [r9], r8w 48FFC1 inc rcx G_M000_IG14: ;; offset=015CH 483BCE cmp rcx, rsi 72E1 jb SHORT G_M000_IG11 G_M000_IG15: ;; offset=0161H 498BC4 mov rax, r12 G_M000_IG16: ;; offset=0164H C5F877 vzeroupper 4883C420 add rsp, 32 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415C pop r12 415E pop r14 415F pop r15 C3 ret ; Total bytes of code 374 ```

JitDisasm is really cool, big thanks for that feature 👍🏻

danmoseley commented 2 years ago

@gfoidl you mentioned you see the same regression vs 6.0 on this example, but do you still see the wins in the scenarios you tried while working on the PR -- or did they vanish somehow?

@DrewScoggins or @EgorBo can you help point me at recent results for our perf test here? I am not sure how to find them. Specifically I'd like to see what change they saw either side of the PR https://github.com/dotnet/runtime/pull/67049.

@dotnet/jit-contrib do you see anything in the assembly before and after that @gfoidl posted above? including the annotated version higher up.

EgorBo commented 2 years ago

We can only disable dynamic PGO (D-PGO), not the static PGO. Correct?

DOTNET_JitDisablePGO=1 should disable both. Static PGO specifically can be disabled by simply using DOTNET_ReadyToRun=0

What is the meaning of "edge weights are invalid"?

Nvm, it's just a sign that JIT made some mistakes calculating edges' weights - it happens in many cases.

@DrewScoggins or @EgorBo can you help point me at recent results for our perf test here?

Here is the link: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/TestHistoryIndexIndex.html You can open any machine, e.g. Ubuntu x64 and then find the benchmark you need via Ctrl+F, e.g. https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fSystem.Tests.Perf_String.Replace_Char(text%3a%20%22Hello%22%2c%20oldChar%3a%20%27a%27%2c%20newChar%3a%20%27b%27).html

gfoidl commented 2 years ago

do you still see the wins in the scenarios you tried while working on the PR -- or did they vanish somehow?

While working on the PR I duplicated the code of string.Replace in order to be able to compare both version (default and PR) within the same benchmark run on .NET 7. There was a clear win for the PR.

When I use the benchmark from your gist (.NET 6 vs .NET 7 and the bits provided by the SDK installer) with different input-strings (kind of manual random exploration) it seems like this is really weird. Most of the times .NET 6 is faster, even for inputs where the vectorized code-path isn't hit at all. Only if nothing needs to be replaced .NET 7 is consistently faster.

With the same inputs and the duplicated code (the benchmark run to compare old and new version of code with .NET 7) the new version is faster.

danmoseley commented 2 years ago

Bearing in mind the change went in on Aug 16.

Some scenarios show no change (but a regression earlier in the month) image

Some show a big improvement image image

Some a regression image

Not sure whether you can access these files @gfoidl I'm assuming not (and if not ideally we would fix that)

danmoseley commented 2 years ago

With the same inputs and the duplicated code (the benchmark run to compare old and new version of code with .NET 7) the new version is faster.

This is curious .. what could cause the standalone to be faster, but in the product slower .. especially if PGO is disabled... code alignment?

gfoidl commented 2 years ago

Yes I can access these files.

It's also interesting that arm64 / ubuntu 20.04 shows improvements, e.g. grafik

Wasm (when I interpret "CompilationMode=wasm,RunKind=micro" correct) shows a regression, and I assume that wasm doesn't support vectors (or does it).

code alignment?

This is my bet. Some results in the charts look unstable -- which might be from alignment (amongst others).

What bothers me is that with the duplicated code the theory of "processing the remaining elements in one vectorized pass is faster" is confirmed. But with the shipped bits it is not clear, rather it regresses. For sure, theory is one thing and measurements the one real thing, but a couple of vector instructions should beat a scalar loop with up to 16 iterations (AVX machine).

danmoseley commented 2 years ago

@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?

AndyAyersMS commented 2 years ago

Not sure whether you can access these files @gfoidl I'm assuming not (and if not ideally we would fix that)

The perf history should be visible by anyone.

@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?

For code alignment you can (with suitable checked jit) also set DOTNET_JitDasmWithAlignmentBoundaries=1. But there is subtlety here, in particular around the interaction of jumps and 32 byte boundaries per the intel jcc erratum. We currently do not have the ability to mitigate this in the jit.

Give we're dealing with strings. data alignment may also play a role. You might try rerunning the BDN results with increased numbers of measurement intervals and memory randomization (--memoryRandomization --iterationCount 100).

gfoidl commented 2 years ago

Running the benchmarks with --memoryRandomization true doesn't change the picture in general for the comparison of .NET 6 <-> .NET 7.

What's a "with suitable checked jit"? The checked build with one can get the JIS-dasm?

What else can I do to move forward here?

kunalspathak commented 2 years ago

@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?

In both .NET 6 and .NET 7, the vectorized loop is not aligned and I don't see any JCC erratum coming in the way in those loop.s

.NET 6 code ```asm ; Assembly listing for method System.String:Replace(ushort,ushort):System.String:this ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; optimized code ; rsp based frame ; fully interruptible ; Final local variable assignments ; ; V00 this [V00,T08] ( 6, 4 ) ref -> rsi this class-hnd single-def ; V01 arg1 [V01,T09] ( 3, 3 ) ushort -> rdx single-def ; V02 arg2 [V02,T10] ( 3, 3 ) ushort -> r8 single-def ; V03 loc0 [V03,T14] ( 4, 2 ) int -> r15 ; V04 loc1 [V04,T00] ( 9, 25.50) int -> rsi ; V05 loc2 [V05,T12] ( 6, 3 ) ref -> r14 class-hnd single-def ; V06 loc3 [V06,T15] ( 4, 2 ) int -> r15 ; V07 loc4 [V07,T01] ( 7, 24.50) byref -> rbp ; V08 loc5 [V08,T02] ( 7, 24.50) byref -> rdx ; V09 loc6 [V09,T26] ( 2, 4.50) simd32 -> mm0 ld-addr-op ; V10 loc7 [V10,T27] ( 2, 4.50) simd32 -> mm1 ld-addr-op ; V11 loc8 [V11,T23] ( 3, 12 ) simd32 -> mm2 ; V12 loc9 [V12,T24] ( 3, 12 ) simd32 -> mm3 ; V13 loc10 [V13,T25] ( 2, 8 ) simd32 -> mm2 ; V14 loc11 [V14,T03] ( 3, 10 ) ushort -> rax ; V15 OutArgs [V15 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace" ; V16 tmp1 [V16,T04] ( 3, 8 ) byref -> rcx ; V17 tmp2 [V17,T05] ( 3, 8 ) byref -> rcx ; V18 tmp3 [V18,T06] ( 3, 8 ) int -> rax ; V19 tmp4 [V19,T17] ( 2, 2 ) byref -> rcx single-def "Inlining Arg" ; V20 tmp5 [V20,T18] ( 2, 2 ) byref -> rdx single-def "Inlining Arg" ;* V21 tmp6 [V21 ] ( 0, 0 ) byref -> zero-ref "impAppendStmt" ; V22 tmp7 [V22,T21] ( 2, 2 ) long -> r8 "Inlining Arg" ;* V23 tmp8 [V23 ] ( 0, 0 ) byref -> zero-ref "impAppendStmt" ;* V24 tmp9 [V24 ] ( 0, 0 ) byref -> zero-ref single-def "impAppendStmt" ;* V25 tmp10 [V25 ] ( 0, 0 ) byref -> zero-ref single-def "impAppendStmt" ;* V26 tmp11 [V26 ] ( 0, 0 ) long -> zero-ref "Inlining Arg" ; V27 tmp12 [V27,T19] ( 2, 2 ) byref -> rbp single-def "Inlining Arg" ; V28 tmp13 [V28,T20] ( 2, 2 ) byref -> rdx single-def "Inlining Arg" ;* V29 tmp14 [V29 ] ( 0, 0 ) simd32 -> zero-ref "Inlining Arg" ; V30 cse0 [V30,T22] ( 3, 1.50) long -> rax "CSE - moderate" ; V31 cse1 [V31,T07] ( 5, 7 ) int -> rdi "CSE - aggressive" ; V32 cse2 [V32,T11] ( 4, 4.50) int -> rbx "CSE - moderate" ; V33 cse3 [V33,T16] ( 4, 2 ) int -> r14 "CSE - moderate" ; V34 cse4 [V34,T13] ( 4, 2 ) byref -> rbp "CSE - moderate" ; ; Lcl frame size = 40 G_M34983_IG01: push r15 push r14 push rdi push rsi push rbp push rbx sub rsp, 40 vzeroupper mov rsi, rcx ;; bbWeight=1 PerfScore 7.50 G_M34983_IG02: movzx rdi, dx movzx rbx, r8w cmp edi, ebx jne SHORT G_M34983_IG05 ;; bbWeight=1 PerfScore 1.75 G_M34983_IG03: mov rax, rsi ; ............................... 32B boundary ............................... ;; bbWeight=0.50 PerfScore 0.12 G_M34983_IG04: vzeroupper add rsp, 40 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ;; bbWeight=0.50 PerfScore 2.62 G_M34983_IG05: lea rbp, bword ptr [rsi+12] mov rcx, rbp mov r14d, dword ptr [rsi+8] mov r8d, r14d mov edx, edi ; ............................... 32B boundary ............................... call System.SpanHelpers:IndexOf(byref,ushort,int):int mov r15d, eax test r15d, r15d jge SHORT G_M34983_IG07 mov rax, rsi ;; bbWeight=0.50 PerfScore 3.00 G_M34983_IG06: vzeroupper add rsp, 40 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ret: 0 ; jcc erratum) 32B boundary ............................... ;; bbWeight=0.50 PerfScore 2.62 G_M34983_IG07: mov esi, r14d sub esi, r15d mov ecx, r14d call System.String:FastAllocateString(int):System.String mov r14, rax test r15d, r15d jle SHORT G_M34983_IG08 cmp dword ptr [r14], r14d lea rcx, bword ptr [r14+12] mov rdx, rbp ; ............................... 32B boundary ............................... mov r8d, r15d add r8, r8 call System.Buffer:Memmove(byref,byref,long) ;; bbWeight=0.50 PerfScore 3.75 G_M34983_IG08: movsxd rax, r15d add rax, rax add rbp, rax cmp dword ptr [r14], r14d lea rdx, bword ptr [r14+12] add rdx, rax cmp esi, 16 ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1 ; jcc erratum) 32B boundary ............................... jl SHORT G_M34983_IG10 imul eax, edi, 0x10001 vmovd xmm0, eax vpbroadcastd ymm0, ymm0 imul eax, ebx, 0x10001 vmovd xmm1, eax vpbroadcastd ymm1, ymm1 ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (vpbroadcastd: 3) 32B boundary ............................... align [0 bytes] ;; bbWeight=0.50 PerfScore 8.38 G_M34983_IG09: vmovupd ymm2, ymmword ptr[rbp] vpcmpeqw ymm3, ymm2, ymm0 vpand ymm4, ymm1, ymm3 vpandn ymm2, ymm3, ymm2 vpor ymm2, ymm4, ymm2 vmovupd ymmword ptr[rdx], ymm2 ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (vmovupd: 2) 32B boundary ............................... add rbp, 32 add rdx, 32 add esi, -16 cmp esi, 16 jge SHORT G_M34983_IG09 ;; bbWeight=4 PerfScore 42.00 G_M34983_IG10: test esi, esi jle SHORT G_M34983_IG15 align [10 bytes] ; ............................... 32B boundary ............................... ;; bbWeight=0.50 PerfScore 0.75 G_M34983_IG11: movzx rax, word ptr [rbp] mov rcx, rdx cmp eax, edi je SHORT G_M34983_IG13 ;; bbWeight=4 PerfScore 14.00 G_M34983_IG12: jmp SHORT G_M34983_IG14 ;; bbWeight=2 PerfScore 4.00 G_M34983_IG13: mov eax, ebx ;; bbWeight=2 PerfScore 0.50 G_M34983_IG14: mov word ptr [rcx], ax add rbp, 2 add rdx, 2 dec esi test esi, esi jg SHORT G_M34983_IG11 ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jg: 0 ; jcc erratum) 32B boundary ............................... ;; bbWeight=4 PerfScore 12.00 G_M34983_IG15: mov rax, r14 ;; bbWeight=0.50 PerfScore 0.12 G_M34983_IG16: vzeroupper add rsp, 40 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ;; bbWeight=0.50 PerfScore 2.62 ; Total bytes of code 307, prolog size 18, PerfScore 136.45, instruction count 106, allocated bytes for code 307 (MethodHash=695a7758) for method System.String:Replace(ushort,ushort):System.String:this ; ============================================================ ```
.NET 7 code ```asm ; Assembly listing for method System.String:Replace(ushort,ushort):System.String:this ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; Tier-1 compilation ; optimized code ; optimized using profile data ; rsp based frame ; fully interruptible ; with Static PGO: edge weights are invalid, and fgCalledCount is 47166 ; 1 inlinees with PGO data; 16 single block inlinees; 0 inlinees without PGO data ; Final local variable assignments ; ; V00 this [V00,T00] ( 13, 5.10) ref -> rsi this class-hnd single-def ; V01 arg1 [V01,T01] ( 6, 4.00) ushort -> rdi single-def ; V02 arg2 [V02,T02] ( 5, 3 ) ushort -> rbx single-def ; V03 loc0 [V03,T03] ( 4, 2.21) int -> rbp ; V04 loc1 [V04,T06] ( 4, 0.10) long -> r14 ; V05 loc2 [V05,T05] ( 7, 0.10) ref -> r15 class-hnd single-def ; V06 loc3 [V06,T04] ( 5, 0.21) int -> rbp ; V07 loc4 [V07,T13] ( 3, 0 ) byref -> rax single-def ; V08 loc5 [V08,T14] ( 3, 0 ) byref -> rdx single-def ; V09 loc6 [V09,T07] ( 14, 0 ) long -> rcx ; V10 loc7 [V10,T17] ( 3, 0 ) simd32 -> mm0 ld-addr-op ; V11 loc8 [V11,T18] ( 3, 0 ) simd32 -> mm1 ld-addr-op ; V12 loc9 [V12,T08] ( 6, 0 ) simd32 -> mm2 ; V13 loc10 [V13,T09] ( 6, 0 ) simd32 -> mm3 ; V14 loc11 [V14,T12] ( 4, 0 ) simd32 -> mm2 ; V15 loc12 [V15,T25] ( 2, 0 ) long -> r14 ; V16 loc13 [V16,T19] ( 3, 0 ) ushort -> r8 ; V17 OutArgs [V17 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace" ; V18 tmp1 [V18,T15] ( 3, 0 ) byref -> r9 ; V19 tmp2 [V19,T16] ( 3, 0 ) byref -> r9 ; V20 tmp3 [V20,T20] ( 3, 0 ) int -> r8 ;* V21 tmp4 [V21 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg" ;* V22 tmp5 [V22 ] ( 0, 0 ) int -> zero-ref "Inlining Arg" ;* V23 tmp6 [V23 ] ( 0, 0 ) short -> zero-ref "Inlining Arg" ; V24 tmp7 [V24,T21] ( 2, 0 ) byref -> rcx single-def "Inlining Arg" ; V25 tmp8 [V25,T22] ( 2, 0 ) byref -> rdx single-def "Inlining Arg" ; V26 tmp9 [V26,T26] ( 2, 0 ) long -> r8 "Inlining Arg" ; V27 tmp10 [V27,T23] ( 2, 0 ) byref -> r8 "Inlining Arg" ; V28 tmp11 [V28,T24] ( 2, 0 ) byref -> r8 "Inlining Arg" ;* V29 tmp12 [V29 ] ( 0, 0 ) simd32 -> zero-ref "Inlining Arg" ;* V30 tmp13 [V30 ] ( 0, 0 ) simd32 -> zero-ref "Inlining Arg" ; V31 tmp14 [V31,T10] ( 4, 0 ) byref -> rsi "Inlining Arg" ; V32 tmp15 [V32,T11] ( 4, 0 ) byref -> rax "Inlining Arg" ;* V33 tmp16 [V33 ] ( 0, 0 ) simd32 -> zero-ref "Inlining Arg" ;* V34 tmp17 [V34 ] ( 0, 0 ) simd32 -> zero-ref "Inlining Arg" ; ; Lcl frame size = 40 G_M34983_IG01: push r15 push r14 push rdi push rsi push rbp push rbx sub rsp, 40 vzeroupper mov rsi, rcx mov edi, edx mov ebx, r8d ;; size=23 bbWeight=1 PerfScore 8.00 G_M34983_IG02: movzx rcx, di movzx r8, bx cmp ecx, r8d ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1 ; jcc erratum) 32B boundary ............................... je G_M34983_IG16 ;; size=16 bbWeight=1 PerfScore 1.75 G_M34983_IG03: lea rcx, bword ptr [rsi+0CH] mov r8d, dword ptr [rsi+08H] movsx rdx, di call [System.SpanHelpers:IndexOfValueType(byref,short,int):int] mov ebp, eax test ebp, ebp jge SHORT G_M34983_IG06 ;; size=24 bbWeight=1.00 PerfScore 7.25 G_M34983_IG04: mov rax, rsi ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 2) 32B boundary ............................... ;; size=3 bbWeight=0.90 PerfScore 0.22 G_M34983_IG05: vzeroupper add rsp, 40 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ;; size=16 bbWeight=0.90 PerfScore 4.71 G_M34983_IG06: mov ecx, dword ptr [rsi+08H] sub ecx, ebp mov r14d, ecx mov ecx, dword ptr [rsi+08H] call System.String:FastAllocateString(int):System.String ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (call: 2 ; jcc erratum) 32B boundary ............................... mov r15, rax test ebp, ebp jg G_M34983_IG18 ;; size=27 bbWeight=0.10 PerfScore 0.72 G_M34983_IG07: mov eax, ebp lea rax, bword ptr [rsi+2*rax+0CH] cmp byte ptr [r15], r15b mov edx, ebp lea rdx, bword ptr [r15+2*rdx+0CH] xor ecx, ecx ; ............................... 32B boundary ............................... cmp dword ptr [rsi+08H], 16 jl G_M34983_IG13 movzx r8, di imul r8d, r8d, 0x10001 vmovd xmm0, r8d vpbroadcastd ymm0, ymm0 movzx r8, bx ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (movzx: 3) 32B boundary ............................... imul r8d, r8d, 0x10001 vmovd xmm1, r8d vpbroadcastd ymm1, ymm1 cmp r14, 16 jbe SHORT G_M34983_IG09 add r14, -16 ;; size=81 bbWeight=0 PerfScore 0.00 G_M34983_IG08: lea r8, bword ptr [rax+2*rcx] ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (lea: 2) 32B boundary ............................... vmovupd ymm2, ymmword ptr[r8] vpcmpeqw ymm3, ymm2, ymm0 vpand ymm4, ymm3, ymm1 vpandn ymm2, ymm3, ymm2 vpor ymm2, ymm4, ymm2 lea r8, bword ptr [rdx+2*rcx] vmovupd ymmword ptr[r8], ymm2 ; ............................... 32B boundary ............................... add rcx, 16 cmp rcx, r14 jb SHORT G_M34983_IG08 ;; size=43 bbWeight=0 PerfScore 0.00 G_M34983_IG09: mov ecx, dword ptr [rsi+08H] add ecx, -16 add rsi, 12 lea rsi, bword ptr [rsi+2*rcx] vmovupd ymm2, ymmword ptr[rsi] vpcmpeqw ymm3, ymm2, ymm0 vpand ymm0, ymm3, ymm1 ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (vpand: 3) 32B boundary ............................... vpandn ymm1, ymm3, ymm2 vpor ymm2, ymm0, ymm1 lea rax, bword ptr [r15+0CH] lea rax, bword ptr [rax+2*rcx] vmovupd ymmword ptr[rax], ymm2 jmp SHORT G_M34983_IG14 ;; size=48 bbWeight=0 PerfScore 0.00 G_M34983_IG10: movzx r8, word ptr [rax+2*rcx] lea r9, bword ptr [rdx+2*rcx] ; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (lea: 2) 32B boundary ............................... movzx r10, di cmp r8d, r10d je SHORT G_M34983_IG11 jmp SHORT G_M34983_IG12 ;; size=20 bbWeight=0 PerfScore 0.00 G_M34983_IG11: movzx r8, bx ;; size=4 bbWeight=0 PerfScore 0.00 G_M34983_IG12: mov word ptr [r9], r8w inc rcx ;; size=7 bbWeight=0 PerfScore 0.00 G_M34983_IG13: cmp rcx, r14 jb SHORT G_M34983_IG10 ;; size=5 bbWeight=0 PerfScore 0.00 G_M34983_IG14: mov rax, r15 ; ............................... 32B boundary ............................... ;; size=3 bbWeight=0 PerfScore 0.00 G_M34983_IG15: vzeroupper add rsp, 40 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ;; size=16 bbWeight=0 PerfScore 0.00 G_M34983_IG16: mov rax, rsi ;; size=3 bbWeight=0 PerfScore 0.00 G_M34983_IG17: vzeroupper add rsp, 40 pop rbx pop rbp pop rsi pop rdi pop r14 ; ............................... 32B boundary ............................... pop r15 ret ;; size=16 bbWeight=0 PerfScore 0.00 G_M34983_IG18: cmp byte ptr [r15], r15b lea rcx, bword ptr [r15+0CH] lea rdx, bword ptr [rsi+0CH] mov r8d, ebp add r8, r8 call [System.Buffer:Memmove(byref,byref,long)] jmp G_M34983_IG07 ;; size=28 bbWeight=0 PerfScore 0.00 ```

One thing that I noticed is in .NET 6, Replace is optimized (I believe with QuickJitLoopBody), but in .NET 7, it goes through tiering.

danmoseley commented 2 years ago

NET 7, it goes through tiering.

Is there an environment variable to try that would force it to be optimized? Although, I thought BDN defeats tiering.

kunalspathak commented 2 years ago

NET 7, it goes through tiering.

Is there an environment variable to try that would force it to be optimized? Although, I thought BDN defeats tiering.

I did try COMPlus_TC_QuickJitForLoops=0, but the codegen doesn't change. So don't think that is an issue.

danmoseley commented 2 years ago

I believe @stephentoub you said that you'd tried various scenarios (lengths, position of hits if any) and it was slower in general. So this is not a case of "are the improvements worth the regressions" -- we have to either fix the regression or revert this change for 7.0 to get back to a known state. Given the schedule, we need to do one or the other by the end of the week.

Is there a clear next step for investigation, or should we revert or 7.0? We can continue working on the problem in order to have the change in 8.0

stephentoub commented 2 years ago

I spent a bit more time running various tests. I suspect this is actually not related to the Replace PR and instead related more to something allocation-related, like regions in .NET 7. I see comparable regressions with these:

const string Input = """
    Whose woods these are I think I know.
    His house is in the village though;
    He will not see me stopping here
    To watch his woods fill up with snow.
    My little horse must think it queer
    To stop without a farmhouse near
    Between the woods and frozen lake
    The darkest evening of the year.
    He gives his harness bells a shake
    To ask if there is some mistake.
    The only other sound’s the sweep
    Of easy wind and downy flake.
    The woods are lovely, dark and deep,
    But I have promises to keep,
    And miles to go before I sleep,
    And miles to go before I sleep.
    """;
private char[] _chars = Input.ToCharArray();

[Benchmark]
public string WithContent1() => new string(_chars);

[Benchmark]
public string WithContent2() => string.Create(Input.Length, Input, (dest, state) => state.AsSpan().CopyTo(dest));

[Benchmark]
public string WithoutContent1() => string.Create(Input.Length, Input, (dest, state) => { });

[Benchmark]
public string WithoutContent2() => new string('\0', Input.Length);
Method Runtime Mean Ratio
WithContent1 .NET 6.0 105.97 ns 1.00
WithContent1 .NET 7.0 120.65 ns 1.15
WithContent2 .NET 6.0 104.12 ns 1.00
WithContent2 .NET 7.0 122.60 ns 1.18
WithoutContent1 .NET 6.0 79.04 ns 1.00
WithoutContent1 .NET 7.0 103.15 ns 1.30
WithoutContent2 .NET 6.0 76.69 ns 1.00
WithoutContent2 .NET 7.0 100.13 ns 1.31
MichalPetryka commented 2 years ago

Does disabling GC regions help?

stephentoub commented 2 years ago

Does disabling GC regions help?

Latest RC2:

Method Mean
WithContent1 121.54 ns
WithContent2 121.04 ns
WithoutContent1 95.10 ns
WithoutContent2 93.52 ns

Latest RC2 w/ DOTNET_GCName="clrgc.dll"

Method Mean
WithContent1 107.23 ns
WithContent2 108.93 ns
WithoutContent1 87.14 ns
WithoutContent2 84.36 ns

@mangod9, is this expected?

mangod9 commented 2 years ago

some of similar smaller microbenchmarks have regressed based on GC writer barrier work. Please see: https://github.com/dotnet/runtime/issues/74014. @PeterSolMS has investigated and determined that the write barrier work should help with most real world workloads. Does COMPLUS_GCWriteBarrier=3 help improve perf for this?

@dotnet/gc

mrsharm commented 2 years ago

Figured this was a good opportunity to make use of the new tooling we created during quality week that automates running microbenchmarks and generates comparative results. (CC: @dotnet/gc)

Summary

I was able to repro this regression locally (.NET7 execution time is much higher than that of .NET6 for longer strings) where I ran all microbenchmarks matching the filter System.Tests.Perf_String.Replace_Char* for:

  1. .NET 6 - baseline
    • Command: dotnet run -f net6.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\baseline
  2. The Latest .NET 7 with the default COMPlus_GCWriteBarrier: 1 - writebarrier_1
    • Command: dotnet run -f net7.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\writebarrier_1
  3. The Latest .NET 7 with COMPlus_GCWriteBarrier=3 and COMPlus_EnableWriteXorExecute=0 - writebarrier_3
    • Command: dotnet run -f net7.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\writebarrier_3 --envVars COMPlus_GCWriteBarrier:3 COMPlus_EnableWriteXorExecute:0
  4. The Latest .NET 7 with Segments Enabled - segments
    • Command: dotnet run -f net7.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\segments --envVars COMPLUS_GCName:clrgc.dll

The main observation was that there was not a significant enough difference after setting COMPlus_GCWriteBarrier=3 and COMPlus_EnableWriteXorExecute=0 and therefore, we regressed by ~7.5% for .NET 7 in comparison to .NET 6 for System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+').

Also, we didn't observe the regression for this large string for segments implying that this microbenchmark regressed once we enabled regions (full details below); unfortunately, the trend for this microbenchmark doesn't go back to before we enabled regions:

image

Source: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fSystem.Tests.Perf_String.Replace_Char(text%3a%20%22yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi79.html

Full Automated Report

Summary

BenchmarkDotNet=v0.13.1.1847-nightly, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores

Microbenchmarks Considered:

Large Regressions (>20%)

Benchmark Name Baseline Comparand Baseline Mean Duration Comparand Mean Duration Δ Mean Duration Δ% Mean Duration

Large Improvements (>20%)

Benchmark Name Baseline Comparand Baseline Mean Duration Comparand Mean Duration Δ Mean Duration Δ% Mean Duration
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'z', newChar: 'y') baseline segments 7.1 5.59 -1.51 -21.29

Regressions (5% - 20%)

Benchmark Name Baseline Comparand Baseline Mean Duration Comparand Mean Duration Δ Mean Duration Δ% Mean Duration
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", oldChar: 'b', newChar: '+') baseline writebarrier_3 19.74 23.14 3.4 17.22
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", oldChar: 'b', newChar: '+') baseline writebarrier_1 19.74 21.81 2.07 10.49
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') baseline writebarrier_1 117.59 126.24 8.65 7.35
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') baseline writebarrier_3 117.59 125.53 7.94 6.75

Improvements (5% - 20%)

Benchmark Name Baseline Comparand Baseline Mean Duration Comparand Mean Duration Δ Mean Duration Δ% Mean Duration
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'z', newChar: 'y') baseline writebarrier_3 7.1 5.84 -1.26 -17.78
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'z', newChar: 'y') baseline writebarrier_1 7.1 5.88 -1.23 -17.26
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') baseline segments 16.78 15.71 -1.06 -6.34
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!') baseline segments 10.37 9.8 -0.57 -5.52
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!') baseline writebarrier_1 10.37 9.82 -0.55 -5.27

Stale (Same or percent difference within 5% margin)

Benchmark Name Baseline Comparand Baseline Mean Duration Comparand Mean Duration Δ Mean Duration Δ% Mean Duration
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!') baseline writebarrier_3 10.37 9.91 -0.46 -4.4
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'a', newChar: 'b') baseline segments 3.26 3.13 -0.13 -3.87
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') baseline writebarrier_1 16.78 16.18 -0.59 -3.55
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') baseline writebarrier_3 16.78 16.9 0.13 0.76
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') baseline segments 117.59 120.41 2.82 2.4
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'a', newChar: 'b') baseline writebarrier_3 3.26 3.36 0.1 2.97
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'a', newChar: 'b') baseline writebarrier_1 3.26 3.36 0.1 3.06
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", oldChar: 'b', newChar: '+') baseline segments 19.74 20.69 0.95 4.82
stephentoub commented 2 years ago

Does COMPLUS_GCWriteBarrier=3 help improve perf for this?

Not really, in fact on its own it seems to mostly make it worse. DOTNET_GCName="clrgc.dll" to disable regions consistently improves things:

Method EnvironmentVariables Mean Ratio
WithContent1 COMPLUS_GCWriteBarrier=3 118.50 ns 1.05
WithContent1 DOTNET_GCName=clrgc.dll 104.32 ns 0.94
WithContent1 DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 103.97 ns 0.93
WithContent1 Empty 111.99 ns 1.00
WithContent2 COMPLUS_GCWriteBarrier=3 119.47 ns 1.04
WithContent2 DOTNET_GCName=clrgc.dll 103.81 ns 0.91
WithContent2 DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 105.34 ns 0.92
WithContent2 Empty 114.48 ns 1.00
WithoutContent1 COMPLUS_GCWriteBarrier=3 93.04 ns 0.96
WithoutContent1 DOTNET_GCName=clrgc.dll 86.99 ns 0.89
WithoutContent1 DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 84.69 ns 0.87
WithoutContent1 Empty 97.29 ns 1.00
WithoutContent2 COMPLUS_GCWriteBarrier=3 92.96 ns 1.02
WithoutContent2 DOTNET_GCName=clrgc.dll 84.50 ns 0.92
WithoutContent2 DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 82.33 ns 0.90
WithoutContent2 Empty 91.44 ns 1.00
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System;

[Config(typeof(ConfigWithCustomEnvVars))]
public partial class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private sealed class ConfigWithCustomEnvVars : ManualConfig
    {
        public ConfigWithCustomEnvVars()
        {
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).AsBaseline());
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithEnvironmentVariables(new EnvironmentVariable("DOTNET_GCName", "clrgc.dll")));
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithEnvironmentVariables(new EnvironmentVariable("COMPLUS_GCWriteBarrier", "3")));
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithEnvironmentVariables(new EnvironmentVariable("DOTNET_GCName", "clrgc.dll"), new EnvironmentVariable("COMPLUS_GCWriteBarrier", "3")));
        }
    }

    const string Input = """
        Whose woods these are I think I know.
        His house is in the village though;
        He will not see me stopping here
        To watch his woods fill up with snow.
        My little horse must think it queer
        To stop without a farmhouse near
        Between the woods and frozen lake
        The darkest evening of the year.
        He gives his harness bells a shake
        To ask if there is some mistake.
        The only other sound’s the sweep
        Of easy wind and downy flake.
        The woods are lovely, dark and deep,
        But I have promises to keep,
        And miles to go before I sleep,
        And miles to go before I sleep.
        """;
    private char[] _chars = Input.ToCharArray();

    [Benchmark]
    public string WithContent1() => new string(_chars);

    [Benchmark]
    public string WithContent2() => string.Create(Input.Length, Input, (dest, state) => state.AsSpan().CopyTo(dest));

    [Benchmark]
    public string WithoutContent1() => string.Create(Input.Length, Input, (dest, state) => { });

    [Benchmark]
    public string WithoutContent2() => new string('\0', Input.Length);
}
AndyAyersMS commented 2 years ago

Seeing possibly similar issues over in #64626 (ignore the PGO aspect; we're regressed even w/o PGO).

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details
Moving discussion from the PR https://github.com/dotnet/runtime/pull/67049 @gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g. ```c# const string Input = """ Whose woods these are I think I know. His house is in the village though; He will not see me stopping here To watch his woods fill up with snow. My little horse must think it queer To stop without a farmhouse near Between the woods and frozen lake The darkest evening of the year. He gives his harness bells a shake To ask if there is some mistake. The only other sound’s the sweep Of easy wind and downy flake. The woods are lovely, dark and deep, But I have promises to keep, And miles to go before I sleep, And miles to go before I sleep. """; [Benchmark] public string Replace() => Input.Replace('I', 'U'); ``` Method Runtime Mean Ratio Replace .NET 6.0 108.1 ns 1.00 Replace .NET 7.0 136.0 ns 1.26 Do you see otherwise? @gfoidl [gfoidl](https://github.com/gfoidl) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229429727) Hm, that is not expected... When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-preview.7.22377.5 [Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT Method Mean Error StdDev Median Ratio RatioSD Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00 PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07 so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass. I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK. Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7? How can I investigate this further -- need some guidance on how to check code-layout please. @stephentoub [stephentoub](https://github.com/stephentoub) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229431842) • Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise? @tannergooding [tannergooding](https://github.com/tannergooding) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229492295) When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see This could be related to stale PGO data @danmoseley [danmoseley](https://github.com/danmoseley) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229507483) Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it. @danmoseley [danmoseley](https://github.com/danmoseley) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229507793) Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches? @stephentoub [stephentoub](https://github.com/stephentoub) commented [yesterday](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229508682) • I don't think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7. @danmoseley [danmoseley](https://github.com/danmoseley) commented [21 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229552351) • I ran the example above with var config = DefaultConfig.Instance .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1")) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0")) .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0")) .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline()) .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0")); BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config); and got BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-rc.2.22426.5 [Host] : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 | Method | EnvironmentVariables | Runtime | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------- |------------------------ |------------------- |---------:|---------:|---------:|---------:|------:|--------:|-------:|----------:|------------:| | Replace | COMPlus_JitDisablePGO=0 | .NET 6.0 | 130.5 ns | 6.76 ns | 18.51 ns | 124.0 ns | 0.92 | 0.17 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=0 | .NET 7.0 | 144.0 ns | 2.95 ns | 5.54 ns | 142.5 ns | 1.00 | 0.00 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=0 | .NET Core 3.1 | 822.1 ns | 16.09 ns | 23.07 ns | 814.0 ns | 5.69 | 0.31 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=0 | .NET Framework 4.8 | 750.2 ns | 28.86 ns | 82.82 ns | 730.3 ns | 4.97 | 0.49 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET 6.0 | 127.1 ns | 2.64 ns | 4.75 ns | 126.4 ns | 0.88 | 0.05 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET 7.0 | 144.5 ns | 2.96 ns | 5.97 ns | 144.1 ns | 1.01 | 0.06 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET Core 3.1 | 936.2 ns | 17.96 ns | 22.06 ns | 931.9 ns | 6.50 | 0.37 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET Framework 4.8 | 673.2 ns | 12.41 ns | 23.91 ns | 670.5 ns | 4.68 | 0.23 | 0.3262 | 1.34 KB | 1.00 | code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251 (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS ) @danmoseley [danmoseley](https://github.com/danmoseley) commented [21 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229555272) and just for interest | Method | EnvironmentVariables | Runtime | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------- |---------------------------------------------------- |--------- |-----------:|---------:|----------:|-----------:|------:|--------:|-------:|----------:|------------:| | Replace | COMPlus_JitDisablePGO=1 | .NET 6.0 | 127.8 ns | 2.55 ns | 5.91 ns | 125.8 ns | 0.95 | 0.05 | 0.3266 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1 | .NET 7.0 | 141.0 ns | 2.73 ns | 2.42 ns | 141.1 ns | 1.00 | 0.00 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 6.0 | 163.9 ns | 3.35 ns | 4.81 ns | 163.8 ns | 1.15 | 0.05 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 7.0 | 184.9 ns | 3.59 ns | 4.79 ns | 183.7 ns | 1.32 | 0.05 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 6.0 | 176.1 ns | 3.44 ns | 4.09 ns | 175.9 ns | 1.25 | 0.03 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 7.0 | 192.1 ns | 3.81 ns | 4.53 ns | 190.1 ns | 1.37 | 0.05 | 0.3271 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 6.0 | 1,057.4 ns | 20.95 ns | 40.86 ns | 1,047.2 ns | 7.65 | 0.35 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 7.0 | 947.1 ns | 13.34 ns | 11.83 ns | 948.3 ns | 6.72 | 0.15 | 0.3262 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 6.0 | 496.0 ns | 51.61 ns | 152.17 ns | 463.3 ns | 3.67 | 1.67 | 0.3269 | 1.34 KB | 1.00 | | Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 7.0 | 395.3 ns | 14.32 ns | 41.10 ns | 388.4 ns | 2.95 | 0.27 | 0.3271 | 1.34 KB | 1.00 | @gfoidl [gfoidl](https://github.com/gfoidl) commented [9 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1229954693) Do you see a similar 6 vs 7 difference as I do? Yes (sorry for slow response, was Sunday...). @danmoseley thanks for your numbers. This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments. ```asm ; Program.Replace() mov rcx,1C003C090A0 mov rcx,[rcx] mov edx,49 mov r8d,55 jmp qword ptr [7FFEFA7430C0] ; Total bytes of code 30 ; System.String.Replace(Char, Char) push r15 push r14 push rdi push rsi push rbp push rbx sub rsp,28 vzeroupper mov rsi,rcx mov edi,edx mov ebx,r8d movzx ecx,di movzx r8d,bx cmp ecx,r8d je near ptr M01_L09 lea rcx,[rsi+0C] mov r8d,[rsi+8] movsx rdx,di call qword ptr [7FFEFA7433C0] mov ebp,eax test ebp,ebp jge short M01_L00 mov rax,rsi ; uncommon case, could jump to M01_L09 instead vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L00: mov ecx,[rsi+8] sub ecx,ebp mov r14d,ecx mov ecx,[rsi+8] call System.String.FastAllocateString(Int32) mov r15,rax test ebp,ebp jg near ptr M01_L10 ; should be common path, I don't expect to jump to the end, then back to here M01_L01: mov eax,ebp lea rax,[rsi+rax*2+0C] cmp [r15],r15b mov edx,ebp lea rdx,[r15+rdx*2+0C] xor ecx,ecx cmp dword ptr [rsi+8],10 jl near ptr M01_L07 movzx r8d,di imul r8d,10001 ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected vmovd xmm0,r8d vpbroadcastd ymm0,xmm0 ; should be vpbroadcastb, see comment above movzx r8d,bx imul r8d,10001 vmovd xmm1,r8d vpbroadcastd ymm1,xmm1 ; vpbroadcastb (see above) cmp r14,10 jbe short M01_L03 add r14,0FFFFFFFFFFFFFFF0 M01_L02: lea r8,[rax+rcx*2] vmovupd ymm2,[r8] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm4,ymm3,ymm1 ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this vpandn ymm2,ymm3,ymm2 ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but vpor ymm2,ymm4,ymm2 ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series lea r8,[rdx+rcx*2] vmovupd [r8],ymm2 add rcx,10 cmp rcx,r14 jb short M01_L02 M01_L03: mov ecx,[rsi+8] add ecx,0FFFFFFF0 add rsi,0C lea rsi,[rsi+rcx*2] vmovupd ymm2,[rsi] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm0,ymm3,ymm1 vpandn ymm1,ymm3,ymm2 vpor ymm2,ymm0,ymm1 lea rax,[r15+0C] lea rax,[rax+rcx*2] vmovupd [rax],ymm2 jmp short M01_L08 M01_L04: movzx r8d,word ptr [rax+rcx*2] lea r9,[rdx+rcx*2] movzx r10d,di cmp r8d,r10d je short M01_L05 ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but jmp short M01_L06 ; one jump could be avoided?! M01_L05: movzx r8d,bx M01_L06: mov [r9],r8w inc rcx M01_L07: cmp rcx,r14 jb short M01_L04 M01_L08: mov rax,r15 vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L09: ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to mov rax,rsi ; get less machine code? vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L10: ; this block should be common enough, so should be on the jump-root (see comment above) cmp [r15],r15b ; it's the Memmove-call lea rcx,[r15+0C] lea rdx,[rsi+0C] mov r8d,ebp add r8,r8 call qword ptr [7FFEFA7399F0] jmp near ptr M01_L01 ; Total bytes of code 383 ``` So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path. But I doubt that this allone is the cause for the regression. I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits. If I use a string.Replace-duplicated code for the benchmark, then it's emitted which is what I expect as https://github.com/dotnet/runtime/commit/10d8a36ab669ac95f554e5efcc3c8780b5c50f11 got merged on 2022-05-25. But that shouldn't cause the regression either, as for .NET 6 the same series of vector-instruction are emitted. The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong). So far the "static analysis", but I doubt this is enough. With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment. I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I'm eager to learn).
Machine code for .NET 6 (for reference) ```asm ; System.String.Replace(Char, Char) push r15 push r14 push rdi push rsi push rbp push rbx sub rsp,28 vzeroupper mov rsi,rcx movzx edi,dx movzx ebx,r8w cmp edi,ebx jne short M01_L00 mov rax,rsi vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L00: lea rbp,[rsi+0C] mov rcx,rbp mov r14d,[rsi+8] mov r8d,r14d mov edx,edi call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32) mov r15d,eax test r15d,r15d jge short M01_L01 mov rax,rsi vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L01: mov esi,r14d sub esi,r15d mov ecx,r14d call System.String.FastAllocateString(Int32) mov r14,rax test r15d,r15d jle short M01_L02 cmp [r14],r14d lea rcx,[r14+0C] mov rdx,rbp mov r8d,r15d add r8,r8 call System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr) M01_L02: movsxd rax,r15d add rax,rax add rbp,rax cmp [r14],r14d lea rdx,[r14+0C] add rdx,rax cmp esi,10 jl short M01_L04 imul eax,edi,10001 vmovd xmm0,eax vpbroadcastd ymm0,xmm0 imul eax,ebx,10001 vmovd xmm1,eax vpbroadcastd ymm1,xmm1 M01_L03: vmovupd ymm2,[rbp] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm4,ymm1,ymm3 vpandn ymm2,ymm3,ymm2 vpor ymm2,ymm4,ymm2 vmovupd [rdx],ymm2 add rbp,20 add rdx,20 add esi,0FFFFFFF0 cmp esi,10 jge short M01_L03 M01_L04: test esi,esi jle short M01_L08 nop word ptr [rax+rax] M01_L05: movzx eax,word ptr [rbp] mov rcx,rdx cmp eax,edi je short M01_L06 jmp short M01_L07 M01_L06: mov eax,ebx M01_L07: mov [rcx],ax add rbp,2 add rdx,2 dec esi test esi,esi jg short M01_L05 M01_L08: mov rax,r14 vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ; Total bytes of code 307 ```
@AndyAyersMS [AndyAyersMS](https://github.com/AndyAyersMS) commented [2 hours ago](https://github.com/dotnet/runtime/pull/67049#issuecomment-1230539420) (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS ) Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes. For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).
Author: danmoseley
Assignees: -
Labels: `blocking-release`, `tenet-performance`, `area-GC-coreclr`, `regression-from-last-release`
Milestone: 7.0.0
mrsharm commented 2 years ago

Repeated the WithContent experiment in a GC centric way where we hardcoded the iterations, reduced the number of forced induced GC by BDN and prevented the removal of outliers between segments, regions and regions with a more precise write barrier and concluded similar results as above:

  1. Regions and Regions w/ the Write Barrier regressed.
  2. The Write Barrier regressed more than the Regions case.
  3. Page faults are higher in the case of both comparands.

Next steps: We are tracking decommit issues here: https://github.com/dotnet/runtime/issues/73592 and since the diagnosis of this regression is perceivably linked to that issue that @PeterSolMS is currently working on fixing, I'm moving the milestone to 8.0; Please let me know if that's an incorrect follow-up.

CC: @dotnet/gc

Data / Details

BenchmarkDotNet=v0.13.1.1847-nightly, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-YUBOFT : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2

Comparisons

Benchmark Name Baseline Comparand Baseline Mean Duration Comparand Mean Duration Δ Mean Duration Δ% Mean Duration Baseline number of iterations Comparand number of iterations Δ number of iterations Δ% number of iterations Baseline gc count Comparand gc count Δ gc count Δ% gc count Baseline median Comparand median Δ median Δ% median Baseline non induced gc count Comparand non induced gc count Δ non induced gc count Δ% non induced gc count Baseline total allocated (mb) Comparand total allocated (mb) Δ total allocated (mb) Δ% total allocated (mb) Baseline total pause time (msec) Comparand total pause time (msec) Δ total pause time (msec) Δ% total pause time (msec) Baseline gc pause time % Comparand gc pause time % Δ gc pause time % Δ% gc pause time % Baseline PageFault Comparand PageFault Δ PageFault Δ% PageFault
System.Tests.Perf_String.WithContent1 baseline writebarrier_3 209.38 675 465.62 222.39 64 64 0 0 4 4 0 0 0 100 100 0 0 0 NaN 1.91 1.92 0 0.08 1.52 1.52 -0 -0 2.04 2.44 0.4 19.42 11 13 2 18.18
System.Tests.Perf_String.WithContent1 baseline regions 209.38 389.06 179.69 85.82 64 64 0 0 4 4 0 0 0 0 0 NaN 0 0 0 NaN 1.91 1.92 0 0.09 1.52 1.52 -0 -0.01 2.04 2.2 0.16 7.86 11 16 5 45.45

Individual Results

Baseline: Segments

OutlierMode=DontRemove EnvironmentVariables=COMPlus_GCName=clrgc.dll PowerPlanMode=00000000-0000-0000-0000-000000000000
Force=False InvocationCount=1 IterationCount=64
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
UnrollFactor=1 WarmupCount=1

Method Mean Error StdDev StdErr Median Min Max Q1 Q3 Op/s Allocated
WithContent1 209.4 ns 198.4 ns 459.7 ns 57.46 ns 0.0 ns 0.0 ns 2,200.0 ns 0.0 ns 100.0 ns 4,776,119.4 1.79 KB

Regions

OutlierMode=DontRemove PowerPlanMode=00000000-0000-0000-0000-000000000000 Force=False
InvocationCount=1 IterationCount=64 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 UnrollFactor=1
WarmupCount=1

Method Mean Error StdDev StdErr Median Min Max Q1 Q3 Op/s Allocated
WithContent1 389.1 ns 356.7 ns 826.8 ns 103.3 ns 0.0 ns 0.0 ns 2,500.0 ns 0.0 ns 0.0 ns 2,570,281.1 1.79 KB

Regions w/ A More Precise Write Barrier

OutlierMode=DontRemove EnvironmentVariables=COMPlus_GCWriteBarrier=3,COMPlus_EnableWriteXorExecute=0 PowerPlanMode=00000000-0000-0000-0000-000000000000
Force=False InvocationCount=1 IterationCount=64
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
UnrollFactor=1 WarmupCount=1

Method Mean Error StdDev StdErr Median Min Max Q1 Q3 Op/s Allocated
WithContent1 675.0 ns 833.7 ns 1,932.3 ns 241.5 ns 100.0 ns 0.0 ns 14,400.0 ns 100.0 ns 125.0 ns 1,481,481.5 1.79 KB
danmoseley commented 2 years ago

I'm moving the milestone to 8.0; Please let me know if that's an incorrect follow-up.

@mrsharm the latest data comparing 6.0 to 7.0 (with standard settings) seems to show significant regressions. I don't think we want to willingly ship with these regressions. I think we need to keep this in .NET 7 milestone and the GC path, one way or another.

Given we don't want to regress, are these our options for .NET 7?

  1. find a GC fix
  2. revert the original change as a workaround (this assumes the old code isn't equally impacted by GC changes,) and solve in .NET 8
  3. prove that this regression does not show up in real code, only in benchmarks.
stephentoub commented 2 years ago

revert the original change as a workaround

What change? The most recent benchmark examples aren't using Replace.

Maoni0 commented 2 years ago

this should be investigated to understand what exactly is causing the regression before we move the milestone to 8.0. last time I was going to investigate these regressions with Moko we discovered that BDN was doing things very differently between 2 runs so he's been focusing on making the runs repeatable. so I'm going to take another look again with him.

danmoseley commented 2 years ago

What change? The most recent benchmark examples aren't using Replace.

Ah, I didn't see that. Scratch that option then. So we should either confirm it's not something real code would see, or find a fix.

stephentoub commented 2 years ago

Others are noticing this as well, e.g. https://twitter.com/realDotNetDave/status/1569724335220088832

mrsharm commented 2 years ago

We have been working through root-causing and fixing this issue and here are our conclusions:

  1. We root caused this issue to the introduction of a higher number and size of decommits for regions (enabled in 7.0) as opposed to segments (default in 6.0).
  2. We are working on a fix: https://github.com/dotnet/runtime/pull/73620 that improves the numbers for these benchmarks, however, even with this fix we are not at the same level as segments in terms of mean execution time, number and size of virtual commits and decommits as before and this is something we plan to further address.
  3. We think the next steps here are to merge the fix that improves the regression and work on further improvements in 8.0. More specifically, we should merge the fix mentioned above and move the issue to 8.0 (cc: @mangod9)
  4. We found that running the benchmarks above using default parameters results in an inequitable for our GC evaluation comparison purposes where:
    • There are an unequal number of invocations per iteration resulting in a lob sided effect on the mean. Fixed by passing in a derived invocation count by doing a dry run and making use of that invocation count for all our runs.
    • Outliers are removed. Fixed by not removing outliers by setting the following in the command line args to run the microbenchmarks --outliers DontRemove.
    • There are a variable number of induced GCs resulting from the unequal number of invocations and evaluation overhead runs. Fixed by removing the evaluation overhead stage.
  5. Using a more equitable comparison repeated a number of times, we consistently observed the following:
    • The regressions for the tests for regions were never above 10%.
    • More consistent results that were less volatile.

Any thoughts / concerns here?

Results

Benchmark Baseline Comparand ΔMean ΔMean %
System.Tests.Perf_String.Replace_Char_Custom segments regions 8.04 9.28
System.Tests.Perf_String.Replace_Char_Custom segments decommit_fix 4.84 5.72
Benchmark Baseline Comparand ΔMean ΔMean %
System.Tests.Perf_String.WithContent1 segments regions 3.21 4.44
System.Tests.Perf_String.WithContent1 segments decommit_fix 1.84 2.55
Benchmark Baseline Comparand ΔMean ΔMean %
System.Tests.Perf_String.WithoutContent2 segments regions 2.78 5.26
System.Tests.Perf_String.WithoutContent2 segments decommit_fix 1.43 2.7

Details

Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-OERVVK : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Full comparison table with the virtual commit and decommit data: Compare.xlsx

CC: @dotnet/gc

mangod9 commented 2 years ago

Thanks for the analysis @mrsharm. If the overall regression < 5% with the decommit fix we should be able to make further improvements in 8. @stephentoub would you agree?

danmoseley commented 2 years ago

@mrsharm

  1. is it possible to characterize which types of scenarios are affected by this regression? What do they have in common?
  2. Will customers experience this full regression or do we have reason to believe it's worse in benchmarking situations?

10% can be a lot, so it seems important to know what customers will see and where.

mrsharm commented 2 years ago

is it possible to characterize which types of scenarios are affected by this regression? What do they have in common?

The common symptoms are higher virtual commits and decommits - both in terms of bytes and the number of calls. We are tracking all other regressed benchmarks with the same root cause here: https://github.com/dotnet/runtime/issues/73592

2. Will customers experience this full regression or do we have reason to believe it's worse in benchmarking situations?

Other members of @dotnet/gc can chime in here for more details however, we do believe that we are doing better for actual customer scenarios based on our perf testing. Additionally, we do believe that prospective updates will improve GC pause times even further.

stephentoub commented 2 years ago

@mrsharm, I'm a little confused. https://github.com/dotnet/runtime/pull/73620 is about workstation GC, right? Does it also apply to server GC? All the regressions I've measured have been with server GC.

mangod9 commented 2 years ago

correct, the current fix is for wks, but @PeterSolMS @Maoni0 are working on a change for svr. @stephentoub, assume you observe regression for wks as well, or is it with svr only?

stephentoub commented 2 years ago

assume you observe regression for wks as well, or is it with svr only?

I'd only measured with server.

Maoni0 commented 2 years ago

we didn't try the server GC fix on these since we didn't know you were running this with Server GC. how many heaps are you running this with? would it be possible that you could take a GCCollectOnly trace when you running it?

perfview /nogui /accepteula /GCCollectOnly collect test.etl

stephentoub commented 2 years ago

we didn't try the server GC fix on these since we didn't know you were running this with Server GC. how many heaps are you running this with? would it be possible that you could take a GCCollectOnly trace when you running it?

This is on a 12 logical core, so I assume 12 (I've not overridden any defaults, other than having <ServerGarbageCollection>true</ServerGarbageCollection> in my csproj). These are all trivial microbenchmarks, with the effect visible in or out of benchmarkdotnet. For example, this:

using System.Diagnostics;
using System.Runtime;

internal class Program
{
    static void Main()
    {
        Console.WriteLine($"IsServerGC: {GCSettings.IsServerGC}");
        var sw = new Stopwatch();
        for (int trials = 0; trials < 10; trials++)
        {
            sw.Restart();
            for (int i = 0; i < 100_000_000; i++)
            {
                _ = new string('\0', 256);
            }
            sw.Stop();
            Console.WriteLine(sw.Elapsed.TotalMilliseconds);
        }
    }
}

outputs the following on .NET 6:

IsServerGC: True
3541.6588
3853.6238
3792.7101
3909.1076
3854.9526
3445.2005
3344.1826
3343.1901
3331.0722
3410.051

and the following on .NET 7:

IsServerGC: True
4627.28
4488.4879
4512.589
4499.0618
5184.8121
4649.4892
4735.7333
4693.8634
4869.3155
4920.06

(obviously times vary a bit from run to run, but the relative size of the gap remains).

I can send you a GCCollectOnly trace if you can't repro it. I assumed from all of the comments from @mrsharm earlier that it was easily reproed.

Maoni0 commented 2 years ago

I assumed from all of the comments from @mrsharm earlier that it was easily reproed.

he never attempted to repro it with server GC since none of us knew you ran this with server GC :) and this is a very unusual scenario for running server GC with (we don't run microbenchmarks with server gc because they are in general tiny things). but yes, we could repro it after we tried it so we don't need a trace.

stephentoub commented 2 years ago

he never attempted to repro it with server GC since none of us knew you ran this with server GC :) and this is a very unusual scenario for running server GC with (we don't run microbenchmarks with server gc because they are in general tiny things)

Except for specific scenarios that demand otherwise, I always benchmark with server GC, since that's the default ASP.NET configuration.

Maoni0 commented 2 years ago

keep in mind that asp.net does not do all these induced GCs - these induced GCs that BDN does can dramatically change the memory perf behavior.

stephentoub commented 2 years ago

It's why I also replicate findings without benchmarkdotnet, e.g. in the simple console app shown earlier where there aren't any GC.Collect calls.

Maoni0 commented 2 years ago

so I took a look at the test @stephentoub had running in BDN and standalone. the reasons for the regression is very different. for the BDN scenario we are simply doing a ton of decommitting with regions because BDN induced a lot of gen2 blocking GCs which meant we often got a new gen1 region that's fully committed so we ended up always decommitting the end of it. so it's more an artifact of running BDN. for the standalone case we do a bit of decommitting, the major regression comes from memset costing more. Peter attempted a fix for it but we don't understand the microarchitectural effect what we are seeing there for me to be comfortable to merge it. and the fix makes the segment case improve just as much so regions is not better off :D I'm also seeing higher GC cost but this could totally be due to the fact it's a microbenchmark. I'm not so worried about this because we already validated the GC part a bunch.

the summary is we are not trying to get anything in for this this week. we might try to get something in next week if we understand the microarchitectural issue but the likelihood is we'll make some fixes with the 1st servicing release.

mangod9 commented 2 years ago

removing release-blocking tag, we will possibly consider for servicing

mrsharm commented 1 year ago

We have improved the microbenchmark numbers for a number of benchmarks in NET 8 and are tracking: https://github.com/dotnet/runtime/issues/73592

Closing this issue as we are tracking this with the aforementioned issue.