Closed performanceautofiler[bot] closed 9 months ago
Name | Value |
---|---|
Architecture | x64 |
OS | ubuntu 22.04 |
Queue | TigerUbuntu |
Baseline | da4e544809b3b10b7db8dc170331988d817240d7 |
Compare | f6c592995a4f8526508da33761230a3850d942ff |
Diff | Diff |
Configs | CompilationMode:tiered, LLVM:false, MonoAOT:true, MonoInterpreter:false, RunKind:micro_mono |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio |
---|---|---|---|---|---|---|---|---|
|
2.51 ฮผs | 2.91 ฮผs | 1.16 | 0.01 | True | |||
|
2.13 ฮผs | 2.53 ฮผs | 1.19 | 0.01 | True | |||
|
6.36 ฮผs | 7.08 ฮผs | 1.11 | 0.03 | True | |||
|
2.12 ฮผs | 2.55 ฮผs | 1.20 | 0.01 | True | |||
|
2.51 ฮผs | 2.92 ฮผs | 1.17 | 0.01 | True | |||
|
2.51 ฮผs | 2.92 ฮผs | 1.16 | 0.01 | True | |||
|
2.12 ฮผs | 2.55 ฮผs | 1.20 | 0.01 | True | |||
|
6.36 ฮผs | 7.37 ฮผs | 1.16 | 0.04 | True | |||
|
6.47 ฮผs | 7.15 ฮผs | 1.11 | 0.03 | True | |||
|
2.13 ฮผs | 2.53 ฮผs | 1.19 | 0.01 | True | |||
|
2.14 ฮผs | 2.54 ฮผs | 1.19 | 0.01 | True | |||
|
2.12 ฮผs | 2.54 ฮผs | 1.20 | 0.01 | True | |||
|
2.51 ฮผs | 2.92 ฮผs | 1.17 | 0.01 | True | |||
|
1.18 ฮผs | 1.34 ฮผs | 1.14 | 0.02 | True | |||
|
6.50 ฮผs | 7.09 ฮผs | 1.09 | 0.02 | True |
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Name | Value |
---|---|
Architecture | x64 |
OS | ubuntu 22.04 |
Queue | TigerUbuntu |
Baseline | 9d08b24d743d0c57203a55c3d0c6dd1bc472e57e |
Compare | f6c592995a4f8526508da33761230a3850d942ff |
Diff | Diff |
Configs | CompilationMode:tiered, LLVM:false, MonoAOT:true, MonoInterpreter:false, RunKind:micro_mono |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio |
---|---|---|---|---|---|---|---|---|
|
77.58 ns | 111.93 ns | 1.44 | 0.15 | False | |||
|
8.39 ฮผs | 12.52 ฮผs | 1.49 | 0.23 | False |
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Name | Value |
---|---|
Architecture | x64 |
OS | ubuntu 22.04 |
Queue | TigerUbuntu |
Baseline | da4e544809b3b10b7db8dc170331988d817240d7 |
Compare | f6c592995a4f8526508da33761230a3850d942ff |
Diff | Diff |
Configs | CompilationMode:tiered, LLVM:false, MonoAOT:true, MonoInterpreter:false, RunKind:micro_mono |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio |
---|---|---|---|---|---|---|---|---|
|
7.34 ฮผs | 7.71 ฮผs | 1.05 | 0.00 | True |
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Regressions are likely caused by https://github.com/dotnet/runtime/pull/91884. Details are in https://github.com/dotnet/runtime/pull/91884#issuecomment-1718416993.
/cc: @MihaZupan
CompilationMode:tiered, LLVM:false, MonoAOT:true, MonoInterpreter:false, RunKind:micro_mono
I take it this is a configuration without hardware intrinsics support? In that case, yes, it's likely https://github.com/dotnet/runtime/pull/91884
Thanks, I think SIMD intrinsics should be enabled on Mono AOT.
/cc: @fanyang-mono @matouskozak
I think that by default SIMD should be enabled but I'm unsure of the exact config that is running in the lab.
I took a brief look at the PR (https://github.com/dotnet/runtime/pull/91884). The hardware intrinsics required to keep the performance were Sse41
and AdvSimd
. For this issue, they are regressions on x64. So only Sse41
is relevant here.
First of all, for Mono, Sse41
is currently only supported by LLVM codegen engine.
Secondly, the microbenchmark tests were set up to run on LLVM AOT with JIT fall back
. If the methods were AOT'ed with LLVM, then yes, the hardware intrinsics would have been emitted.
However, Mono AOT compiler in normal mode doesn't compile everything, generics is one of them. System.Collections.Perf_Frozen<ReferenceType>
clearly uses generics. Thus, it fell back to JIT, where LLVM was not supported. As a result, the expected Sse41
intrinsics weren't emitted.
I believe the rest of the microbenchmarks fell into the same category, where they have code patterns that Mono AOT compiler doesn't compile in normal mode.
I anticipate similar issue will be reported on arm64
, where AdvSimd
is only supported for Full AOT with LLVM
(not the mode that the microbenchmarks run with).
Because of the current limitation of Mono, I suggest that https://github.com/dotnet/runtime/pull/91884 should be reverted or add it back for Mono only.
The before/after https://github.com/dotnet/runtime/pull/91884 alters which scalar loop we're using:
With the loops being almost the same, the only difference is in the Contains
method we call for each position:
In my testing with RyuJIT on x64, the throughput of the two was almost identical.
However, looking at a benchmark like System.Memory.ReadOnlySpan.IndexOfString(input: "string1", value: "string2", comparisonType: InvariantCulture)
in this issue, the difference reported here is 7x.
Is Mono running into some horrible codegen issue for the after case? I could expect some difference, but something is likely going wrong to get to 7x.
/cc @jeffhandley
@MihaZupan @kotlarmilos @fanyang-mono - It is a little confusing from comments in the PR itself. Can you confirm this and other Mono config regressions are caused by https://github.com/dotnet/runtime/pull/91884 OR https://github.com/dotnet/runtime/pull/91887 ?
We need to decide to revert OR special case for the regressed Mono codegen path immediately. If any of the PRs were backported, we need to make sure that is also addressed.
The regressions are about https://github.com/dotnet/runtime/pull/91884. https://github.com/dotnet/runtime/pull/91887 is unrelated. Nothing was backported, we're talking about 9.0 main
only.
It would be good to understand why this is happening - https://github.com/dotnet/perf-autofiling-issues/issues/21818#issuecomment-1728345461.
I will gather the generated code and report back later.
Impact of this regression was across the Mono engines:
@kotlarmilos help me confirmed that it was https://github.com/dotnet/runtime/pull/91884, which caused this regression.
According to @MihaZupan's analysis of the library code change. It boiled down to the difference between System.Buffers.BitVector256:Contains256 (char) and System.Buffers.ProbabilisticMap:Contains (System.ReadOnlySpan`1
The IL code for them are
method to IR System.Buffers.BitVector256:Contains256 (char)
converting (in B3: stack: 0) IL_0000: ldarg.1
converting (in B3: stack: 1) IL_0001: ldc.i4 256
converting (in B3: stack: 2) IL_0006: bge.s IL_0011
converting (in B5: stack: 0) IL_0008: ldarg.0
converting (in B5: stack: 1) IL_0009: ldarg.1
converting (in B5: stack: 2) IL_000a: call 0x0600370f
cmethod = bool System.Buffers.BitVector256:ContainsUnchecked (int)
converting (in B5: stack: 1) IL_000f: br.s IL_0012
method to IR System.Buffers.ProbabilisticMap:Contains (System.ReadOnlySpan`1<char>,char)
converting (in B3: stack: 0) IL_0000: ldarg.0
converting (in B3: stack: 1) IL_0001: call 0x2b00015e
cmethod = char& System.Runtime.InteropServices.MemoryMarshal:GetReference<char> (System.ReadOnlySpan`1<char>)
converting (in B3: stack: 1) IL_0006: call 0x2b000183
cmethod = int16& System.Runtime.CompilerServices.Unsafe:As<char, int16> (char&)
converting (in B3: stack: 1) IL_000b: ldarg.1
converting (in B3: stack: 2) IL_000c: conv.i2
converting (in B3: stack: 2) IL_000d: ldarga.s 0
converting (in B3: stack: 3) IL_000f: call 0x0a000007
cmethod = int System.ReadOnlySpan`1<char>:get_Length ()
converting (in B3: stack: 3) IL_0014: call 0x2b0007d1
cmethod = bool System.SpanHelpers:NonPackedContainsValueType<int16> (int16&,int16,int)
converting (in B3: stack: 1) IL_0019: ret
Then I dived into the calls these two methods made and found out that System.SpanHelpers:NonPackedContainsValueType<int16> (int16&,int16,int)
(https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs#L1318) is a very expensive call, which eventually caused the regression.
The next piece of the puzzle is that why the generated code for that method from Mono is so slow. I noticed that that method contains vectorization code for Vector128
, which Mono has intrinsics support across all codegen engine.
I will report back when I have more information about that.
Correction to my description in https://github.com/dotnet/perf-autofiling-issues/issues/21818#issuecomment-1728345461:
I was describing how the change in https://github.com/dotnet/runtime/pull/91884 would impact IndexOfAny
paths (given that the regressed benchmark is IndexOfString
). But the code that actually ends up being used here is IndexOfAnyExcept
, which uses a meaningfully different code path for the probabilistic case.
Before the change, it would be using BitVector256.Contains256(char)
as described above, but after the change, it would be using ProbabilisticMap.IndexOfAnySimpleLoop
. This path is different as it effectively does an O(n * m)
search, which is bound to be noticeably slower if there are many values present, as is the case here.
The performance here hasn't really been a concern before, as we would normally pick the really fast ASCII implementation that currently works when X86/Arm64/Wasm intrinsics are available (but not "plain Vector128
").
With that, even if Mono's execution of NonPackedContainsValueType
was as optimal as possible, it would still see a noticeable regression here.
Given the above, I think we should make sure IndexOfAnyAsciiSearcher
is supported everywhere we care about perf. This should be the real long-term solution.
We could revert https://github.com/dotnet/runtime/pull/91884 until we can do so, as it may not be trivial.
To make it clear, Mono would need to make IndexOfAnyAsciiSearcher.IsVectorizationSupported
return true across all the codegen engine to not experience this regression without reverting https://github.com/dotnet/runtime/pull/91884
And the definition of IndexOfAnyAsciiSearcher.IsVectorizationSupported
is
https://github.com/dotnet/runtime/blob/aec06846449c331532022b9c01874c80e5a35fc6/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs#L35
As I mentioned in my original comment, the current support for Ssse3
and AdvSimd.Arm64
are limited.
To clarify, supporting IndexOfAnyAsciiSearcher
doesn't necessarily mean that Ssse3
/AdmSimd.Arm64
should always be supported.
IsVectorizationSupported
just reflects what the current implementation supports. If mono always supports Vector128
, we could make sure that the implementation has paths for Vector128
in core places like here and here, then flip IsVectorizationSupported
to be just Vector128.IsHardwareAccelerated
.
Mono indeed supports Vector128
cross all codegen engine. Could you implement that to see if it will eliminate the regression here?
@Miha and I discussed offline. We've reached an agreement that his original PR (https://github.com/dotnet/runtime/pull/91884) will be reverted now. And he will be working on adding the Vector128
codepath and make sure it doesn't regress Mono.
I'll run benchmarks on https://github.com/dotnet/runtime/pull/92680, but the performance will depend on Mono's implementations of Vector128
support. I'll ping you when I have some numbers.
After the revert https://github.com/dotnet/runtime/pull/92726 we can see the regressions disappeared for AOT https://github.com/dotnet/perf-autofiling-issues/issues/22587.
Run Information
Regressions in System.Memory.ReadOnlySpan
Test Report
Repro
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md
Repro Steps
#### Prerequisites (Files either built locally (with build.(sh/cmd) or downloaded from payload above (if same system setup) (in this order)) - Libraries build extracted to `runtime/artifacts` or build instructions: [Libraries README](https://github.com/dotnet/runtime/blob/main/docs/workflow/building/libraries/README.md) args: `-subset libs+libs.tests -rc release -configuration Release -arch $RunArch -framework net8.0` - CoreCLR product build extracted to `runtime/artifacts/bin/coreclr/$RunOS.$RunArch.Release`, build instructions: [CoreCLR README](https://github.com/dotnet/runtime/blob/main/docs/workflow/building/coreclr/README.md) args: `-subset clr+libs -rc release -configuration Release -arch $RunArch -framework net8.0` - AOT MONO build extracted to `runtime/artifacts/bin/mono/$RunOS.$RunArch.Release`, build instructions: [MONO README](https://github.com/dotnet/runtime/blob/main/docs/workflow/building/mono/README.md) args: `-arch $RunArch -os $RunOS -s mono+libs+host+packs -c Release /p:CrossBuild=false /p:MonoLLVMUseCxx11Abi=false` - Dotnet SDK installed for dotnet commands - Running commands from the runtime folder Linux ```cmd # Set $RunDir to the runtime directory RunDir=`pwd` # Set the OS, arch, and OSId RunOS='linux' RunOSId='linux' RunArch='x64' # Create aot directory mkdir -p $RunDir/artifacts/bin/aot/sgen mkdir -p $RunDir/artifacts/bin/aot/pack cp -r $RunDir/artifacts/obj/mono/$RunOS.$RunArch.Release/mono/* $RunDir/artifacts/bin/aot/sgen cp -r $RunDir/artifacts/bin/microsoft.netcore.app.runtime.$RunOS-$RunArch/Release/* $RunDir/artifacts/bin/aot/pack # Create Core Root $RunDir/src/tests/build.sh release $RunArch generatelayoutonly /p:LibrariesConfiguration=Release # Clone performance git clone --branch main --depth 1 --quiet https://github.com/dotnet/performance.git $RunDir/performance # One line run: python3 $RunDir/performance/scripts/benchmarks_ci.py --csproj $RunDir/performance/src/benchmarks/micro/MicroBenchmarks.csproj --incremental no --architecture $RunArch -f net8.0 --filter 'System.Memory.ReadOnlySpan*' --bdn-artifacts $RunDir/artifacts/BenchmarkDotNet.Artifacts --bdn-arguments="--anyCategories Libraries Runtime --category-exclusion-filter NoAOT NoWASM --runtimes monoaotllvm --aotcompilerpath $RunDir/artifacts/bin/aot/sgen/mini/mono-sgen --customruntimepack $RunDir/artifacts/bin/aot/pack --aotcompilermode llvm --logBuildOutput --generateBinLog" # Individual Commands: # Restore dotnet restore $RunDir/performance/src/benchmarks/micro/MicroBenchmarks.csproj --packages $RunDir/performance/artifacts/packages /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 # Build dotnet build $RunDir/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework net8.0 --no-restore /p:NuGetPackageRoot=$RunDir/performance/artifacts/packages /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 # Run dotnet run --project $RunDir/performance/src/benchmarks/micro/MicroBenchmarks.csproj --configuration Release --framework net8.0 --no-restore --no-build -- --filter System.Memory.ReadOnlySpan* --anyCategories Libraries Runtime " --category-exclusion-filter NoAOT NoWASM --runtimes monoaotllvm --aotcompilerpath $RunDir/artifacts/bin/aot/sgen/mini/mono-sgen --customruntimepack $RunDir/artifacts/bin/aot/pack --aotcompilermode llvm --logBuildOutput --generateBinLog " --artifacts $RunDir/artifacts/BenchmarkDotNet.Artifacts --packages $RunDir/performance/artifacts/packages --buildTimeout 1200 ``` Windows ```cmd # Set $RunDir to the runtime directory $RunDir="FullPathHere" # Set the OS, arch, and OSId RunOS='windows' RunOSId='win' RunArch='x64' # Create aot directory mkdir $RunDir\artifacts\bin\aot\sgen mkdir $RunDir\artifacts\bin\aot\pack xcopy $RunDir\artifacts\obj\mono\$RunOS.$RunArch.Release\mono $RunDir\artifacts\bin\aot\sgen\ /e /y xcopy $RunDir\artifacts\bin\microsoft.netcore.app.runtime.$RunOSId-$RunArch\Release $RunDir\artifacts\bin\aot\pack\ /e /y # Create Core Root $RunDir\src\tests\build.cmd release $RunArch generatelayoutonly /p:LibrariesConfiguration=Release # Clone performance git clone --branch main --depth 1 --quiet https://github.com/dotnet/performance.git $RunDir\performance # One line run: python3 $RunDir\performance\scripts\benchmarks_ci.py --csproj $RunDir\performance\src\benchmarks\micro\MicroBenchmarks.csproj --incremental no --architecture $RunArch -f net8.0 --filter 'System.Memory.ReadOnlySpan*' --bdn-artifacts $RunDir\artifacts\BenchmarkDotNet.Artifacts --bdn-arguments="--anyCategories Libraries Runtime --category-exclusion-filter NoAOT NoWASM --runtimes monoaotllvm --aotcompilerpath $RunDir\artifacts\bin\aot\sgen\mini\mono-sgen.exe --customruntimepack $RunDir\artifacts\bin\aot\pack --aotcompilermode llvm --logBuildOutput --generateBinLog" # Individual Commands: # Restore dotnet restore $RunDir\performance\src\benchmarks\micro\MicroBenchmarks.csproj --packages $RunDir\performance\artifacts\packages /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 # Build dotnet build $RunDir\performance\src\benchmarks\micro\MicroBenchmarks.csproj --configuration Release --framework net8.0 --no-restore /p:NuGetPackageRoot=$RunDir\performance\artifacts\packages /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 # Run dotnet run --project $RunDir\performance\src\benchmarks\micro\MicroBenchmarks.csproj --configuration Release --framework net8.0 --no-restore --no-build -- --filter System.Memory.ReadOnlySpan* --anyCategories Libraries Runtime " --category-exclusion-filter NoAOT NoWASM --runtimes monoaotllvm --aotcompilerpath $RunDir\artifacts\bin\aot\sgen\mini\mono-sgen.exe --customruntimepack $RunDir\artifacts\bin\aot\pack -aotcompilermode llvm --logBuildOutput --generateBinLog " --artifacts $RunDir\artifacts\BenchmarkDotNet.Artifacts --packages $RunDir\performance\artifacts\packages --buildTimeout 1200 ```