Open performanceautofiler[bot] opened 1 year ago
Architecture | x64 |
---|---|
OS | ubuntu 18.04 |
Baseline | a8549002b2df6bf64c7c882d8f117630beba762d |
Compare | 8a2b72e3140b14a2ec915345b2747eecdecff943 |
Diff | Diff |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL |
---|---|---|---|---|---|---|---|---|---|---|
[GetHashCodeBenchmark - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Numerics.Tests.Perf_VectorOf(Int32).GetHashCodeBenchmark.html>) | 20.03 ns | 23.32 ns | 1.16 | 0.02 | False |
git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_VectorOf<Int32>*'
Architecture | x64 |
---|---|
OS | ubuntu 18.04 |
Baseline | a8549002b2df6bf64c7c882d8f117630beba762d |
Compare | 8a2b72e3140b14a2ec915345b2747eecdecff943 |
Diff | Diff |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL |
---|---|---|---|---|---|---|---|---|---|---|
[GetHashCodeBenchmark - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Runtime.Intrinsics.Tests.Perf_Vector128Of(SByte).GetHashCodeBenchmark.html>) | 38.11 ns | 54.62 ns | 1.43 | 0.01 | False |
git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>*'
Architecture | x64 |
---|---|
OS | ubuntu 18.04 |
Baseline | a8549002b2df6bf64c7c882d8f117630beba762d |
Compare | 8a2b72e3140b14a2ec915345b2747eecdecff943 |
Diff | Diff |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL |
---|---|---|---|---|---|---|---|---|---|---|
[GetHashCodeBenchmark - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Numerics.Tests.Perf_VectorOf(SByte).GetHashCodeBenchmark.html>) | 79.26 ns | 105.48 ns | 1.33 | 0.06 | False |
git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Numerics.Tests.Perf_VectorOf<SByte>*'
Architecture | x64 |
---|---|
OS | ubuntu 18.04 |
Baseline | a8549002b2df6bf64c7c882d8f117630beba762d |
Compare | 8a2b72e3140b14a2ec915345b2747eecdecff943 |
Diff | Diff |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL |
---|---|---|---|---|---|---|---|---|---|---|
[GetHashCodeBenchmark - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Runtime.Intrinsics.Tests.Perf_Vector128Of(UInt64).GetHashCodeBenchmark.html>) | 5.77 ns | 7.44 ns | 1.29 | 0.16 | False |
git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>*'
Architecture | x64 |
---|---|
OS | ubuntu 18.04 |
Baseline | a8549002b2df6bf64c7c882d8f117630beba762d |
Compare | 8a2b72e3140b14a2ec915345b2747eecdecff943 |
Diff | Diff |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL |
---|---|---|---|---|---|---|---|---|---|---|
[GetHashCodeBenchmark - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Runtime.Intrinsics.Tests.Perf_Vector128Of(Int16).GetHashCodeBenchmark.html>) | 19.68 ns | 24.73 ns | 1.26 | 0.02 | False |
git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>*'
Architecture | x64 |
---|---|
OS | ubuntu 18.04 |
Baseline | a8549002b2df6bf64c7c882d8f117630beba762d |
Compare | 8a2b72e3140b14a2ec915345b2747eecdecff943 |
Diff | Diff |
Benchmark | Baseline | Test | Test/Base | Test Quality | Edge Detector | Baseline IR | Compare IR | IR Ratio | Baseline ETL | Compare ETL |
---|---|---|---|---|---|---|---|---|---|---|
[GetHashCodeBenchmark - Duration of single invocation](<https://pvscmdupload.z22.web.core.windows.net/reports/allTestHistory/refs/heads/main_x64_ubuntu 18.04/System.Runtime.Intrinsics.Tests.Perf_Vector128Of(Int32).GetHashCodeBenchmark.html>) | 9.41 ns | 13.56 ns | 1.44 | 0.12 | False |
git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>*'
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
IL diff for HashCode.Combine
:
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak See info in area-owners.md if you want to be subscribed.
Author: | performanceautofiler[bot] |
---|---|
Assignees: | EgorBo |
Labels: | `os-linux`, `tenet-performance`, `tenet-performance-benchmarks`, `arch-x64`, `area-CodeGen-coreclr` |
Milestone: | - |
So it is no longer inlined:
multiplier in methods of struct increased to 3.
Inline candidate is generic and caller is not. Multiplier increased to 5.
Inline has 2 foldable BOX ops. Multiplier increased to 8.
Inline candidate callsite is in a loop. Multiplier increased to 11.
calleeNativeSizeEstimate=1251
callsiteNativeSizeEstimate=85
benefit multiplier=11
threshold=935
Native estimate for function size exceeds threshold for inlining 125.1 > 93.5 (multiplier = 11)
So I assume Roslyn has fixed some IL bug or something like that (e.g. https://github.com/dotnet/roslyn/pull/64789) so this is not actionable
The codegen difference in the screenshot above looks like the result of https://github.com/dotnet/roslyn/pull/65642, related issue in dotnet/runtime is #73606.
Performance impact of that change, FYI @jaredpar @AlekseyTs
Is there a suggestion on how we can emit this code better? The runtime team asked us to fix a bug which to our knowledge requires the new codegen. Until we properly spill these values codegen can't fix their bug.
Is there a suggestion on how we can emit this code better? The runtime team asked us to fix a bug which to our knowledge requires the new codegen. Until we properly spill these values codegen can't fix their bug.
I just wanted to let you know that we also try to track perf issues after Roslyn updates 🙂 - these regressions don't look bad to me at all honestly, the HashCode.Combine is perf critical, but the big performance difference is purely because of inlining. So I assume we can close this. Unless somebody thinks it should be improved.
Is there a suggestion on how we can emit this code better? The runtime team asked us to fix a bug which to our knowledge requires the new codegen. Until we properly spill these values codegen can't fix their bug.
A copy is only needed when the receiver could actually be changed between loading its address and the constrained call. It is not the case in the above screenshot (and probably in most cases). However, it would of course require some analysis to determine whether it was safe to omit the copy.
A copy is only needed when the receiver could actually be changed between loading its address and the constrained call. It is not the case in the above screenshot (and probably in most cases). However, it would of course require some analysis to determine whether it was safe to omit the copy.
We do detect some arguments "safe" for reordering, but it was not our goal to be able to accurately detect "safety" for non-trivial scenarios. I assume that the argument in the scenario is not trivial, perhaps another method call or something similar.
@AlekseyTs maybe I am missing something, but there are no args in the above screenshot -- the IL emitted before is
ldarga.s value1
constrained. !!T1
callvirt instance int32 System.Object::GetHashCode()
maybe I am missing something, but there are no args in the above screenshot
What C# code looks like? Perhaps the IL is different due to a different reason. Due to https://github.com/dotnet/roslyn/pull/66250, for example.
What C# code looks like? Perhaps the IL is different due to a different reason
So actually https://github.com/dotnet/roslyn/pull/66250 might be the cause after all?
I think so.
We can change the runtime implementation here to avoid the conditional access to fix this regression. However, it kind of puts us down a path of never using conditional access on generically typed values that may be structs.
@jakobbotsch what about putting some level of “trust” in the C# language features such as “readonly members” as a way to determine “this method mutates vs not”?
That is, rather than doing the complex analysis on the JIT side to determine if a non inlined method does mutate x, and would therefore require a copy to be preserved, we can check for the IsReadOnlyAttribute and use that as a much cheaper way of determining the same thing.
There are of course unsafe scenarios where that may change the observed output, but it will be correct in the vast majority of cases and match how C# will itself emit code in other scenarios involving the same (minus generics because no “readonly” concept exists for reference types, hence the general regression here)
That should allow us to much more trivially fix this case and improve codegen in a number of other places while also taking advantage of C# metadata that already tracks the info we care about
That should allow us to much more trivially fix this case and improve codegen in a number of other places while also taking advantage of C# metadata that already tracks the info we care about
We likely have all information we need to eliminate this copy in the vast majority of cases even today (e.g. in the above case we can see that the copy is the last use, so nothing fundamentally prevents us from removing it). But actually doing so requires #8887.
Once we have #8887 I think it makes good sense to experiment with using IsReadOnlyAttribute
too as an input to whether reverse copy propagation can be done safely.
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<Byte>
Test Report
Repro
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<UInt16>
Test Report
Repro
Run Information
Regressions in System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>
Test Report
Repro
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<Double>
Test Report
Repro
Run Information
Regressions in System.Tests.Perf_HashCode
Test Report
Repro
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<Int64>
Test Report
Repro
Run Information
Regressions in System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>
Test Report
Repro
Run Information
Regressions in System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>
Test Report
Repro
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<UInt32>
Test Report
Repro
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<UInt64>
Test Report
Repro
Run Information
Regressions in System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>
Test Report
Repro
Run Information
Regressions in System.Numerics.Tests.Perf_VectorOf<Int16>
Test Report
Repro