Open ghost opened 3 years ago
Hi, please check out https://medium.com/swlh/astonishing-performance-of-net-5-more-data-5cdc8d821e8c
Overall, .NET 5 didn't improve much in this sense, though I guess the issue with this test is that it doesn't involve pinned objects, and these objects were the main target of .NET 5 GC perf. improvements.
They added the bitonic sort in .net afaik.
Have you reached out to maoni or the .net pms?
No, I have not. I'll try to.
Any plan ran this test on .NET6? @alexyakunin
Any plan ran this test on .NET7 ? @alexyakunin
@InCerryGit @PeterXiao As I know, there are no much improvements in GC from .net 5 to net 9. Unless the .net team makes a major overhaul, the results won't improve significantly.
@InCerryGit @PeterXiao As I know, there are no much improvements in GC from .net 5 to net 9. Unless the .net team makes a major overhaul, the results won't improve significantly.
Actually, there are also some performance optimizations and changes, such as segments to regions, datas, etc.
Actually, there are also some performance optimizations and changes, such as segments to regions, datas, etc.
It can be predicted that the impact will not exceed 50%, while this test requires an improvement of dozens or hundreds of times. Otherwise the latency gap between .net and golang will not be significantly shortened.
in .net 8 RAM=16GB STW Max duration : 443.255 ms
Launch parameters:
Software:
Runtime: .NET Core
Version: .NET 8.0.5
GC mode: Workstation GC, Latency mode: Interactive, LOH compaction mode: Default, Large pages: disabled, Generations: 0..2
OS: Microsoft Windows 10.0.19045 (X64)
Hardware:
CPU: AMD Ryzen 5 5600X 6-Core Processor
CPU core count: 12
RAM size: 32 GB
--- Caching / compute server (static set = 50% RAM) ---
Test settings:
Duration: 10 s
Thread count: 12
Static set:
Total size: 16 GB
Object count: 186.218 M
Actual duration: 10.465 s
Allocation speed:
Operations per second: 42.114 M/s
Bytes per second: 3.618 GB/s
Allocation stats:
Size:
Min .. Max:
Min: 32 B
Avg: 92.25 B
Max: 131072 B
Percentiles:
50%: 40 B
90%: 120 B
95%: 144 B
99%: 376 B
99.9%: 5144 B
99.99%: 19456 B
Hold duration:
Min .. Max:
Min: 0 ms
Avg: 11.586 ms
Max: 40000 ms
Percentiles:
50%: 0 ms
90%: 0 ms
95%: 0.1 ms
99%: 100 ms
99.9%: 200 ms
99.99%: 20000 ms
GC stats:
RAM used: 16.03 -> 18.152 GB
GC rate:
Gen0, # per second: 1466.439 /s
Gen1, # per second: 380.298 /s
Gen2, # per second: 13.091 /s
Thread pauses:
% of time frozen: 47.137 %
# per second:
Min .. Max:
Min: 371 /s
Avg: 784.962 /s
Max: 1157 /s
Percentiles:
50%: 785 /s
90%: 1032 /s
95%: 1075 /s
99%: 1147 /s
99.9%: 1149 /s
99.99%: 1149 /s
Global pauses:
% of time frozen: 44.096 %
# per second: 658.928 /s
Pause duration:
Min .. Max:
Min: 0 ms
Avg: 0.669 ms
Max: 443.255 ms
Percentiles:
50%: 0.579 ms
90%: 0.958 ms
95%: 1.012 ms
99%: 1.278 ms
99.9%: 3.704 ms
99.99%: 7.301 ms
@sgf
Thank you very much for your test results. Due to different machine configurations, it might be necessary to run the .NET 5 program on the same machine to compare results. Could you try running it with ServerGC and ServerGC+Datas separately?
set DOTNET_gcServer=1
set DOTNET_GCDynamicAdaptationMode=1
@sgf
Thank you very much for your test results. Due to different machine configurations, it might be necessary to run the .NET 5 program on the same machine to compare results. Could you try running it with ServerGC and ServerGC+Datas separately?
set DOTNET_gcServer=1 set DOTNET_GCDynamicAdaptationMode=1
With only 16GB, the longest pause time is close to 0.5 seconds. I don't think it's necessary to try again. If it's 5ms, maybe I will consider it. .net's larger STW latency does not significantly affect the results even on better machines. As for settings such as Serve GC, there will be no obvious difference.
@sgf Thank you very much for your test results. Due to different machine configurations, it might be necessary to run the .NET 5 program on the same machine to compare results. Could you try running it with ServerGC and ServerGC+Datas separately?
set DOTNET_gcServer=1 set DOTNET_GCDynamicAdaptationMode=1
With only 16GB, the longest pause time is close to 0.5 seconds. I don't think it's necessary to try again. If it's 5ms, maybe I will consider it. .net's larger STW latency does not significantly affect the results even on better machines. As for settings such as Serve GC, there will be no obvious difference.
.NET has seen little improvement in WorkstationGC because its performance is already sufficient. However, ServerGC is a different story; it is actively being updated. Features like BGC, ConcurrentGC, and Regions are currently only available in ServerGC.
RAM=6.427 -> 7.867 GB max STW= 544 ms RAM=16.029 -> 20.386 GB max STW=45.296 ms
Launch parameters:
Software:
Runtime: .NET Core
Version: .NET 8.0.5
GC mode: Server GC, Latency mode: Interactive, LOH compaction mode: Default, Large pages: disabled, Generations: 0..2
OS: Microsoft Windows 10.0.19045 (X64)
Hardware:
CPU: AMD Ryzen 5 5600X 6-Core Processor
CPU core count: 12
RAM size: 32 GB
--- Worker / typical server (static set = 20% RAM) ---
Test settings:
Duration: 10 s
Thread count: 12
Static set:
Total size: 6.4 GB
Object count: 74.482 M
Actual duration: 10.001 s
Allocation speed:
Operations per second: 90.414 M/s
Bytes per second: 7.768 GB/s
Allocation stats:
Size:
Min .. Max:
Min: 32 B
Avg: 92.25 B
Max: 131072 B
Percentiles:
50%: 40 B
90%: 120 B
95%: 144 B
99%: 376 B
99.9%: 5144 B
99.99%: 19456 B
Hold duration:
Min .. Max:
Min: 0 ms
Avg: 11.586 ms
Max: 40000 ms
Percentiles:
50%: 0 ms
90%: 0 ms
95%: 0.1 ms
99%: 100 ms
99.9%: 200 ms
99.99%: 20000 ms
GC stats:
RAM used: 6.427 -> 7.867 GB
Thread pauses:
% of time frozen: 35.697 %
# per second:
Min .. Max:
Min: 2 /s
Avg: 852.207 /s
Max: 1153 /s
Percentiles:
50%: 866 /s
90%: 1100 /s
95%: 1126 /s
99%: 1140 /s
99.9%: 1144 /s
99.99%: 1144 /s
Global pauses:
% of time frozen: 32.53 %
# per second: 757.986 /s
Pause duration:
Min .. Max:
Min: 0 ms
Avg: 0.429 ms
Max: 544 ms
Percentiles:
50%: 0.212 ms
90%: 0.314 ms
95%: 0.355 ms
99%: 1.187 ms
99.9%: 6.21 ms
99.99%: 508.731 ms
--- Caching / compute server (static set = 50% RAM) ---
Test settings:
Duration: 10 s
Thread count: 12
Static set:
Total size: 16 GB
Object count: 186.218 M
Actual duration: 10.007 s
Allocation speed:
Operations per second: 90.499 M/s
Bytes per second: 7.775 GB/s
Allocation stats:
Size:
Min .. Max:
Min: 32 B
Avg: 92.25 B
Max: 131072 B
Percentiles:
50%: 40 B
90%: 120 B
95%: 144 B
99%: 376 B
99.9%: 5144 B
99.99%: 19456 B
Hold duration:
Min .. Max:
Min: 0 ms
Avg: 11.586 ms
Max: 40000 ms
Percentiles:
50%: 0 ms
90%: 0 ms
95%: 0.1 ms
99%: 100 ms
99.9%: 200 ms
99.99%: 20000 ms
GC stats:
RAM used: 16.029 -> 20.386 GB
Thread pauses:
% of time frozen: 35.812 %
# per second:
Min .. Max:
Min: 1 /s
Avg: 760.402 /s
Max: 839 /s
Percentiles:
50%: 771 /s
90%: 802 /s
95%: 819 /s
99%: 836 /s
99.9%: 837 /s
99.99%: 837 /s
Global pauses:
% of time frozen: 33.965 %
# per second: 735.363 /s
Pause duration:
Min .. Max:
Min: 0 ms
Avg: 0.462 ms
Max: 45.296 ms
Percentiles:
50%: 0.416 ms
90%: 0.517 ms
95%: 0.569 ms
99%: 1.166 ms
99.9%: 7.483 ms
99.99%: 22.55 ms
@sgf Thank you for your benchmark results. It seems that ServerGC has made significant improvements compared to WorkstationGC. The 20% pause might be due to insufficient memory allocation, resulting in a longer pause. Once the GC adapts to the load, the pause time improves tenfold compared to WorkstationGC.
Name | Workstation GC | Server GC (20% RAM) | Server GC (50% RAM) |
---|---|---|---|
Duration | 10 s | 10 s | 10 s |
Thread count | 12 | 12 | 12 |
Static set | 50% RAM | 20% RAM | 50% RAM |
Total size | 16 GB | 6.4 GB | 16 GB |
Object count | 186.218 M | 74.482 M | 186.218 M |
Operations per second | 42.114 M/s | 90.414 M/s | 90.499 M/s |
Bytes per second | 3.618 GB/s | 7.768 GB/s | 7.775 GB/s |
Size (Min / Avg / Max) | 32 B / 92.25 B / 131072 B | 32 B / 92.25 B / 131072 B | 32 B / 92.25 B / 131072 B |
50% Percentile | 40 B | 40 B | 40 B |
90% Percentile | 120 B | 120 B | 120 B |
95% Percentile | 144 B | 144 B | 144 B |
99% Percentile | 376 B | 376 B | 376 B |
99.9% Percentile | 5144 B | 5144 B | 5144 B |
99.99% Percentile | 19456 B | 19456 B | 19456 B |
Hold duration (Min / Avg / Max) | 0 ms / 11.586 ms / 40000 ms | 0 ms / 11.586 ms / 40000 ms | 0 ms / 11.586 ms / 40000 ms |
50% Percentile | 0 ms | 0 ms | 0 ms |
90% Percentile | 0 ms | 0 ms | 0 ms |
95% Percentile | 0.1 ms | 0.1 ms | 0.1 ms |
99% Percentile | 100 ms | 100 ms | 100 ms |
99.9% Percentile | 200 ms | 200 ms | 200 ms |
99.99% Percentile | 20000 ms | 20000 ms | 20000 ms |
RAM used (start -> end) | 16.03 -> 18.152 GB | 6.427 -> 7.867 GB | 16.029 -> 20.386 GB |
Gen0, # per second | 1466.439 /s | - | - |
Gen1, # per second | 380.298 /s | - | - |
Gen2, # per second | 13.091 /s | - | - |
Thread Pause | |||
% of time frozen | 47.137 % | 35.697 % | 35.812 % |
# per second (Min / Avg / Max) | 371 /s / 784.962 /s / 1157 /s | 2 /s / 852.207 /s / 1153 /s | 1 /s / 760.402 /s / 839 /s |
50% Percentile | 785 /s | 866 /s | 771 /s |
90% Percentile | 1032 /s | 1100 /s | 802 /s |
95% Percentile | 1075 /s | 1126 /s | 819 /s |
99% Percentile | 1147 /s | 1140 /s | 836 /s |
99.9% Percentile | 1149 /s | 1144 /s | 837 /s |
99.99% Percentile | 1149 /s | 1144 /s | 837 /s |
Stop The World | |||
% of time frozen | 44.096 % | 32.53 % | 33.965 % |
# per second | 658.928 /s | 757.986 /s | 735.363 /s |
Pause duration (Min / Avg / Max) | 0 ms / 0.669 ms / 443.255 ms | 0 ms / 0.429 ms / 544 ms | 0 ms / 0.462 ms / 45.296 ms |
50% Percentile | 0.579 ms | 0.212 ms | 0.416 ms |
90% Percentile | 0.958 ms | 0.314 ms | 0.517 ms |
95% Percentile | 1.012 ms | 0.355 ms | 0.569 ms |
99% Percentile | 1.278 ms | 1.187 ms | 1.166 ms |
99.9% Percentile | 3.704 ms | 6.21 ms | 7.483 ms |
99.99% Percentile | 7.301 ms | 508.731 ms | 22.55 ms |
@sgf Thank you for your benchmark results. It seems that ServerGC has made significant improvements compared to WorkstationGC. The 20% pause might be due to insufficient memory allocation, resulting in a longer pause. Once the GC adapts to the load, the pause time improves tenfold compared to WorkstationGC.
The improvement is not big enough, not radical enough. I hope that the ideal STW pause should be less than 5ms. And the memory within 256G can be easily controlled. Regardless of the number of objects, the STW time and frequency should not be affected.
Since my PC memory is limited, if you want to get the ideal results, you need to use a server for testing. When developing large-scale Web sites or large-scale game servers, 48GB,64GB, 96GB, and 128GB are common memory.
Here is the information about .NET 9 Preview 6 running on a 32-core, 256GB server with 189GB of memory: ServerGC
ServerGC + DATAs
Go only achieved a result of 126GB because it ran out of memory (OOM) at 189GB. Due to the lack of generational garbage collection and compaction in Go, the memory usage is significantly higher. For the 126GB test, it required 217GB of memory to complete.
Hi,
Awesome research you've done here. Have you ran this on .net 5 to see how things have improved?