[Perf] Linux/x64: 4 Regressions on 2/25/2024 4:37:10 PM

performanceautofiler[bot] commented 7 months ago

Run Information

Name	Value
Architecture	x64
OS	ubuntu 22.04
Queue	TigerUbuntu
Baseline	f32c428c86b4cc41e88e2e5a750c37dfb354e33a
Compare	5ef47c852ffd51aaeb52e34391fe4fed261c9f26
Diff	Diff
Configs	CompilationMode:tiered, RunKind:micro

Regressions in System.Text.Tests.Perf_StringBuilder

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio
[ToString_MultipleSegments - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 22.04/System.Text.Tests.Perf_StringBuilder.ToString_MultipleSegments(length%3a%20100000).html>) 📝 - Benchmark Source 📈 - ADX Test Multi Config Graph	87.46 μs	101.66 μs	1.16	0.00	False

graph Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Text.Tests.Perf_StringBuilder*'

### Payloads [Baseline]() [Compare]() ### System.Text.Tests.Perf_StringBuilder.ToString_MultipleSegments(length: 100000) #### ETL Files #### Histogram #### JIT Disasms ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)

DrewScoggins commented 7 months ago

From https://github.com/dotnet/runtime/pull/98623

Improvements: https://github.com/dotnet/perf-autofiling-issues/issues/29825 https://github.com/dotnet/perf-autofiling-issues/issues/29808

mangod9 commented 3 months ago

@EgorBo is this by design, so it could be closed?

EgorBo commented 3 months ago

@EgorBot -amd64 -intel -perf -commit 79dd9bae9bb881eb716b608577c4cedc6c9cba72 vs fab69efde7d2458dfb23e01b686842cf6d7f576d --disasm

using BenchmarkDotNet.Attributes;
using System.Text;

public class MyProgram
{
    const int LOHAllocatedStringSize = 100_000;
    private string _stringLOH;
    private string _string100;
    private StringBuilder _builderSingleSegment100;
    private StringBuilder _builderSingleSegmentLOH;
    private StringBuilder _builderMultipleSegments100;
    private StringBuilder _builderMultipleSegmentsLOH;

    [GlobalSetup(Target = nameof(ToString_MultipleSegments))]
    public void Setup_ToString_MultipleSegments()
    {
        _builderMultipleSegments100 = Append_Char(100); // 16 + 32 + 48 + 96 char segments
        _builderMultipleSegmentsLOH = Append_Char(LOHAllocatedStringSize);
    }

    public StringBuilder Append_Char(int length)
    {
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < length; i++)
            builder.Append('a');
        return builder;
    }

    [Benchmark]
    [Arguments(LOHAllocatedStringSize)]
    public string ToString_MultipleSegments(int length) => (length == 100 ? _builderMultipleSegments100 : _builderMultipleSegmentsLOH).ToString();
}

EgorBot commented 3 months ago

Benchmark results on Intel

``` BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores Job-JXBPJF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-BMGTJF : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI ``` | Method | Toolchain | length | Mean | Error | Ratio | Code Size | |-------------------------- |------------------------ |------- |----------:|---------:|------:|----------:| | ToString_MultipleSegments | Main | 100000 | 72.87 μs | 1.077 μs | 1.00 | 262 B | | ToString_MultipleSegments | PR | 100000 | 110.64 μs | 1.693 μs | 1.52 | 262 B | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_8c5470c6.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_8c5470c6.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_8c5470c6.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/758f50b6ad0a619a3543cd147b1c36f2) vs [PR](https://gist.github.com/EgorBot/10da3a0af43f0cc9d1e4b0ea7e9c9aa1) Hot functions: [Main](https://gist.github.com/EgorBot/5ae4bb35e29faabeac28dd2cd8579223) vs [PR](https://gist.github.com/EgorBot/3f163523a5e941e99316e36ebacd3ea0) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._

EgorBot commented 3 months ago

Benchmark results on Amd

``` BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish) AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores Job-CNQFTW : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 Job-FMFWZL : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2 ``` | Method | Toolchain | length | Mean | Error | Ratio | Code Size | |-------------------------- |------------------------ |------- |---------:|--------:|------:|----------:| | ToString_MultipleSegments | Main | 100000 | 110.0 μs | 0.37 μs | 1.00 | 262 B | | ToString_MultipleSegments | PR | 100000 | 145.6 μs | 0.31 μs | 1.32 | 262 B | [BDN_Artifacts.zip](https://telegafiles.blob.core.windows.net/telega/BDN_Artifacts_e88c1ecc.zip) Flame graphs: [Main](https://telegafiles.blob.core.windows.net/telega/base_flamegraph_e88c1ecc.svg) vs [PR](https://telegafiles.blob.core.windows.net/telega/diff_flamegraph_e88c1ecc.svg) 🔥 Hot asm: [Main](https://gist.github.com/EgorBot/4889a04455c2880b685ba127b727e458) vs [PR](file.notfound) Hot functions: [Main](https://gist.github.com/EgorBot/bc86950e5aecb8add275286ea22984e7) vs [PR](https://gist.github.com/EgorBot/8d00152a352a2131c6e01429e06e2474) _For clean `perf` results, make sure you have just one `[Benchmark]` in your app._

mangod9 commented 3 months ago

@EgorBo so the regression still exists?

EgorBo commented 3 months ago

@EgorBo so the regression still exists?

It does, but there is not much we can do here. Previously we used to call native memmove directly (in coop mode), now we call managed SpanHelpers.Memmove instead that forwards back to native memmove (with gc transition) for large sizes, so we effectively pay for that wrapper. It only happens for large sizes.

Since there are more improvements (11) than regressions (4) and the fact that that PR: 1) Fixed a bug on NAOT: https://github.com/dotnet/runtime/issues/95517 2) Made Memmove more suspension friendly: https://github.com/dotnet/runtime/issues/98620

we can close it.

dotnet / runtime