dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

.NET 8.0.10 vs 9.0.0 RC2 GC Server Performance Regression in Sep (CSV Parser) Benchmark (due to DATAS default) #109047

Open nietras opened 1 day ago

nietras commented 1 day ago

In https://github.com/nietras/Sep (a fast highly optimized CSV parser) I have been comparing performance comparison-bench.ps1 between .NET 8 and .NET 9 RC2 and have observed what appears to be consistent and significant performance regression when using ServerGarbageCollection (true). The benchmark in question is also discussed in https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers

Benchmarks can be run by cloning the Sep repo, checking out branch net9.0 and running the command in the comparison-bench.ps1 perhaps adding --filter *GcServer*Sep* or similar. Details for benchmark, machine are given below via BenchmarkDotNet.

As can be seen this shows regression in a scenario of many medium size object allocations ranging from 500ms/429ms = 1.17x (single thread) to 174ms/102ms = 1.69x (multi-threaded) regression.

I know there have been changes to the GC my question is whether this regression is expected? And just wanted to flag it if it has any interest.

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19044.3086/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-YVJTZC : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-ZDJCYM : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

Server=True  InvocationCount=Default  IterationTime=350ms  
MaxIterationCount=15  MinIterationCount=5  WarmupCount=6  
Quotes=False  Reader=String  
Method Runtime Scope Rows Mean Ratio MB MB/s ns/row Allocated Alloc Ratio
Sep__ .NET 8.0 Asset 50000 21.402 ms 1.00 29 1363.5 428.0 14133102 B 1.00
Sep_MT___ .NET 8.0 Asset 50000 5.576 ms 0.26 29 5233.7 111.5 14308501 B 1.01
Sep__ .NET 9.0 Asset 50000 24.444 ms 1.14 29 1193.8 488.9 14133077 B 1.00
Sep_MT___ .NET 9.0 Asset 50000 8.965 ms 0.42 29 3255.0 179.3 14310332 B 1.01
Sep__ .NET 8.0 Asset 1000000 429.654 ms 1.00 583 1358.7 429.7 273063216 B 1.00
Sep_MT___ .NET 8.0 Asset 1000000 102.979 ms 0.24 583 5668.9 103.0 274049328 B 1.00
Sep__ .NET 9.0 Asset 1000000 500.250 ms 1.16 583 1167.0 500.3 273062592 B 1.00
Sep_MT___ .NET 9.0 Asset 1000000 174.802 ms 0.41 583 3339.7 174.8 273973628 B 1.00
EgorBo commented 1 day ago

Try with DATAS disabled e.g. <GarbageCollectionAdaptationMode>0</GarbageCollectionAdaptationMode>

stephentoub commented 1 day ago

cc: @mangod9, @Maoni0

nietras commented 1 day ago

Command I run from branch net9.0

dotnet run -c Release -f net8.0 --project src/Sep.ComparisonBenchmarks/Sep.ComparisonBenchmarks.csproj -- -m --warmupCount 6 --minIterationCount 5 --maxIterationCount 15 --runtimes net80 net90 --iterationTime 350 --hide Type Quotes Reader RatioSD Gen0 Gen1 Gen2 Error Median StdDev --filter *GcServerLongAsset*Sep*

No change with <GarbageCollectionAdaptationMode>0</GarbageCollectionAdaptationMode> but can't remember if BDN actually forward this to sub-processes? Is there a flag to tell BDN to use this like Server=True?

Server=True  InvocationCount=Default  IterationTime=350ms
MaxIterationCount=15  MinIterationCount=5  WarmupCount=6
Quotes=False  Reader=String

| Method    | Runtime  | Scope | Rows    | Mean     | Ratio | MB  | MB/s   | ns/row | Allocated | Alloc Ratio |
|---------- |--------- |------ |-------- |---------:|------:|----:|-------:|-------:|----------:|------------:|
| Sep______ | .NET 8.0 | Asset | 1000000 | 431.7 ms |  1.00 | 583 | 1352.1 |  431.7 | 260.41 MB |        1.00 |
| Sep_MT___ | .NET 8.0 | Asset | 1000000 | 111.1 ms |  0.26 | 583 | 5252.6 |  111.1 |  261.2 MB |        1.00 |
| Sep______ | .NET 9.0 | Asset | 1000000 | 500.7 ms |  1.16 | 583 | 1165.9 |  500.7 | 260.42 MB |        1.00 |
| Sep_MT___ | .NET 9.0 | Asset | 1000000 | 178.4 ms |  0.41 | 583 | 3272.0 |  178.4 | 261.32 MB |        1.00 |
nietras commented 1 day ago

Yes, it's DATAS. Tried settings it with environment variable e.g. for BDN with --envVars DOTNET_GCDynamicAdaptationMode:0 and tried running with 0 and 1 as can be seen below. This means "regression" is solely due to DATAS being default and otherwise no difference

NO DATAS

dotnet run -c Release -f net8.0 --project src/Sep.ComparisonBenchmarks/Sep.ComparisonBenchmarks.csproj -- -m --warmupCount 6 --minIterationCount 5 --maxIterationCount 15 --runtimes net80 net90 --iterationTime 350 --hide Type Quotes Reader RatioSD Gen0 Gen1 Gen2 Error Median StdDev --filter *GcServerLongAsset*Sep* --envVars DOTNET_GCDynamicAdaptationMode:0
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19044.3086/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-KKDGWQ : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-HUTQEJ : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=0  Server=True  InvocationCount=Default
IterationTime=350ms  MaxIterationCount=15  MinIterationCount=5
WarmupCount=6  Quotes=False  Reader=String

| Method    | Runtime  | Scope | Rows    | Mean     | Ratio | MB  | MB/s   | ns/row | Allocated | Alloc Ratio |
|---------- |--------- |------ |-------- |---------:|------:|----:|-------:|-------:|----------:|------------:|
| Sep______ | .NET 8.0 | Asset | 1000000 | 452.7 ms |  1.00 | 583 | 1289.6 |  452.7 | 260.41 MB |        1.00 |
| Sep_MT___ | .NET 8.0 | Asset | 1000000 | 112.4 ms |  0.25 | 583 | 5195.4 |  112.4 | 261.51 MB |        1.00 |
| Sep______ | .NET 9.0 | Asset | 1000000 | 445.3 ms |  0.98 | 583 | 1310.9 |  445.3 | 260.41 MB |        1.00 |
| Sep_MT___ | .NET 9.0 | Asset | 1000000 | 117.8 ms |  0.26 | 583 | 4954.0 |  117.8 | 261.38 MB |        1.00 |

DATAS

dotnet run -c Release -f net8.0 --project src/Sep.ComparisonBenchmarks/Sep.ComparisonBenchmarks.csproj -- -m --warmupCount 6 --minIterationCount 5 --maxIterationCount 15 --runtimes net80 net90 --iterationTime 350 --hide Type Quotes Reader RatioSD Gen0 Gen1 Gen2 Error Median StdDev --filter *GcServerLongAsset*Sep* --envVars DOTNET_GCDynamicAdaptationMode:1
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19044.3086/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-ZORNME : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-BHTHZN : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_GCDynamicAdaptationMode=1  Server=True  InvocationCount=Default
IterationTime=350ms  MaxIterationCount=15  MinIterationCount=5
WarmupCount=6  Quotes=False  Reader=String

| Method    | Runtime  | Scope | Rows    | Mean     | Ratio | MB  | MB/s   | ns/row | Allocated | Alloc Ratio |
|---------- |--------- |------ |-------- |---------:|------:|----:|-------:|-------:|----------:|------------:|
| Sep______ | .NET 8.0 | Asset | 1000000 | 527.5 ms |  1.00 | 583 | 1106.6 |  527.5 | 260.41 MB |        1.00 |
| Sep_MT___ | .NET 8.0 | Asset | 1000000 | 170.0 ms |  0.32 | 583 | 3433.5 |  170.0 | 261.41 MB |        1.00 |
| Sep______ | .NET 9.0 | Asset | 1000000 | 528.2 ms |  1.00 | 583 | 1105.2 |  528.2 | 260.41 MB |        1.00 |
| Sep_MT___ | .NET 9.0 | Asset | 1000000 | 182.9 ms |  0.35 | 583 | 3192.2 |  182.9 | 261.17 MB |        1.00 |
mangod9 commented 1 day ago

yeah a throughput regression for certain microbenchmark scenarios is expected with DATAS. Assume the benchmark shows improved working set utilization?

hez2010 commented 1 day ago

It is expected in .NET 9.

In general, DATAS should benefit real-world applications a lot as it can largely reduce the working set and also improve GC latency, though it comes with a minor throughput penalty.

In another similar issue (#101006) I did a binary-tree allocation benchmark and got the following benchmark result on .NET 9 rc2:

image-6.png

Considering the large improvements to latency and working set, I would take the minor throughput perf regression.