alexyakunin / GCBurn

Garbage collection / allocation performance tests for various languages (for now, just C# / .NET and Go)
Apache License 2.0
37 stars 7 forks source link

New results on .NET 5 #1

Open ghost opened 3 years ago

ghost commented 3 years ago

Hi,

Awesome research you've done here. Have you ran this on .net 5 to see how things have improved?

alexyakunin commented 3 years ago

Hi, please check out https://medium.com/swlh/astonishing-performance-of-net-5-more-data-5cdc8d821e8c

Overall, .NET 5 didn't improve much in this sense, though I guess the issue with this test is that it doesn't involve pinned objects, and these objects were the main target of .NET 5 GC perf. improvements.

ghost commented 3 years ago

They added the bitonic sort in .net afaik.

Have you reached out to maoni or the .net pms?

alexyakunin commented 3 years ago

No, I have not. I'll try to.

InCerryGit commented 2 years ago

Any plan ran this test on .NET6? @alexyakunin

PeterXiao commented 1 year ago

Any plan ran this test on .NET7 ? @alexyakunin

sgf commented 4 months ago

@InCerryGit @PeterXiao As I know, there are no much improvements in GC from .net 5 to net 9. Unless the .net team makes a major overhaul, the results won't improve significantly.

InCerryGit commented 4 months ago

@InCerryGit @PeterXiao As I know, there are no much improvements in GC from .net 5 to net 9. Unless the .net team makes a major overhaul, the results won't improve significantly.

Actually, there are also some performance optimizations and changes, such as segments to regions, datas, etc.

sgf commented 4 months ago

Actually, there are also some performance optimizations and changes, such as segments to regions, datas, etc.

It can be predicted that the impact will not exceed 50%, while this test requires an improvement of dozens or hundreds of times. Otherwise the latency gap between .net and golang will not be significantly shortened.

sgf commented 4 months ago

in .net 8 RAM=16GB STW Max duration : 443.255 ms

Launch parameters:
Software:
  Runtime:            .NET Core
    Version:          .NET 8.0.5
    GC mode:          Workstation GC, Latency mode: Interactive, LOH compaction mode: Default, Large pages: disabled, Generations: 0..2
  OS:                 Microsoft Windows 10.0.19045 (X64)
Hardware:
  CPU:                AMD Ryzen 5 5600X 6-Core Processor
  CPU core count:     12
  RAM size:           32 GB

--- Caching / compute server (static set = 50% RAM) ---

Test settings:
  Duration:           10 s
  Thread count:       12
  Static set:
    Total size:       16 GB
    Object count:     186.218 M

Actual duration:      10.465 s
Allocation speed:
  Operations per second: 42.114 M/s
  Bytes per second:   3.618 GB/s
  Allocation stats:
    Size:
      Min .. Max:
        Min:          32 B
        Avg:          92.25 B
        Max:          131072 B
      Percentiles:
        50%:          40 B
        90%:          120 B
        95%:          144 B
        99%:          376 B
        99.9%:        5144 B
        99.99%:       19456 B
    Hold duration:
      Min .. Max:
        Min:          0 ms
        Avg:          11.586 ms
        Max:          40000 ms
      Percentiles:
        50%:          0 ms
        90%:          0 ms
        95%:          0.1 ms
        99%:          100 ms
        99.9%:        200 ms
        99.99%:       20000 ms
GC stats:
  RAM used:           16.03 -> 18.152 GB
  GC rate:
    Gen0, # per second: 1466.439 /s
    Gen1, # per second: 380.298 /s
    Gen2, # per second: 13.091 /s
  Thread pauses:
    % of time frozen: 47.137 %
    # per second:
      Min .. Max:
        Min:          371 /s
        Avg:          784.962 /s
        Max:          1157 /s
      Percentiles:
        50%:          785 /s
        90%:          1032 /s
        95%:          1075 /s
        99%:          1147 /s
        99.9%:        1149 /s
        99.99%:       1149 /s
  Global pauses:
    % of time frozen: 44.096 %
    # per second:     658.928 /s
    Pause duration:
      Min .. Max:
        Min:          0 ms
        Avg:          0.669 ms
        Max:          443.255 ms
      Percentiles:
        50%:          0.579 ms
        90%:          0.958 ms
        95%:          1.012 ms
        99%:          1.278 ms
        99.9%:        3.704 ms
        99.99%:       7.301 ms
InCerryGit commented 4 months ago

@sgf

Thank you very much for your test results. Due to different machine configurations, it might be necessary to run the .NET 5 program on the same machine to compare results. Could you try running it with ServerGC and ServerGC+Datas separately?

set DOTNET_gcServer=1
set DOTNET_GCDynamicAdaptationMode=1
sgf commented 4 months ago

@sgf

Thank you very much for your test results. Due to different machine configurations, it might be necessary to run the .NET 5 program on the same machine to compare results. Could you try running it with ServerGC and ServerGC+Datas separately?

set DOTNET_gcServer=1
set DOTNET_GCDynamicAdaptationMode=1

With only 16GB, the longest pause time is close to 0.5 seconds. I don't think it's necessary to try again. If it's 5ms, maybe I will consider it. .net's larger STW latency does not significantly affect the results even on better machines. As for settings such as Serve GC, there will be no obvious difference.

InCerryGit commented 4 months ago

@sgf Thank you very much for your test results. Due to different machine configurations, it might be necessary to run the .NET 5 program on the same machine to compare results. Could you try running it with ServerGC and ServerGC+Datas separately?

set DOTNET_gcServer=1
set DOTNET_GCDynamicAdaptationMode=1

With only 16GB, the longest pause time is close to 0.5 seconds. I don't think it's necessary to try again. If it's 5ms, maybe I will consider it. .net's larger STW latency does not significantly affect the results even on better machines. As for settings such as Serve GC, there will be no obvious difference.

.NET has seen little improvement in WorkstationGC because its performance is already sufficient. However, ServerGC is a different story; it is actively being updated. Features like BGC, ConcurrentGC, and Regions are currently only available in ServerGC.

sgf commented 4 months ago

RAM=6.427 -> 7.867 GB max STW= 544 ms RAM=16.029 -> 20.386 GB max STW=45.296 ms

Launch parameters:
Software:
  Runtime:            .NET Core
    Version:          .NET 8.0.5
    GC mode:          Server GC, Latency mode: Interactive, LOH compaction mode: Default, Large pages: disabled, Generations: 0..2
  OS:                 Microsoft Windows 10.0.19045 (X64)
Hardware:
  CPU:                AMD Ryzen 5 5600X 6-Core Processor
  CPU core count:     12
  RAM size:           32 GB

--- Worker / typical server (static set = 20% RAM) ---

Test settings:
  Duration:           10 s
  Thread count:       12
  Static set:
    Total size:       6.4 GB
    Object count:     74.482 M

Actual duration:      10.001 s
Allocation speed:
  Operations per second: 90.414 M/s
  Bytes per second:   7.768 GB/s
  Allocation stats:
    Size:
      Min .. Max:
        Min:          32 B
        Avg:          92.25 B
        Max:          131072 B
      Percentiles:
        50%:          40 B
        90%:          120 B
        95%:          144 B
        99%:          376 B
        99.9%:        5144 B
        99.99%:       19456 B
    Hold duration:
      Min .. Max:
        Min:          0 ms
        Avg:          11.586 ms
        Max:          40000 ms
      Percentiles:
        50%:          0 ms
        90%:          0 ms
        95%:          0.1 ms
        99%:          100 ms
        99.9%:        200 ms
        99.99%:       20000 ms
GC stats:
  RAM used:           6.427 -> 7.867 GB
  Thread pauses:
    % of time frozen: 35.697 %
    # per second:
      Min .. Max:
        Min:          2 /s
        Avg:          852.207 /s
        Max:          1153 /s
      Percentiles:
        50%:          866 /s
        90%:          1100 /s
        95%:          1126 /s
        99%:          1140 /s
        99.9%:        1144 /s
        99.99%:       1144 /s
  Global pauses:
    % of time frozen: 32.53 %
    # per second:     757.986 /s
    Pause duration:
      Min .. Max:
        Min:          0 ms
        Avg:          0.429 ms
        Max:          544 ms
      Percentiles:
        50%:          0.212 ms
        90%:          0.314 ms
        95%:          0.355 ms
        99%:          1.187 ms
        99.9%:        6.21 ms
        99.99%:       508.731 ms

--- Caching / compute server (static set = 50% RAM) ---

Test settings:
  Duration:           10 s
  Thread count:       12
  Static set:
    Total size:       16 GB
    Object count:     186.218 M

Actual duration:      10.007 s
Allocation speed:
  Operations per second: 90.499 M/s
  Bytes per second:   7.775 GB/s
  Allocation stats:
    Size:
      Min .. Max:
        Min:          32 B
        Avg:          92.25 B
        Max:          131072 B
      Percentiles:
        50%:          40 B
        90%:          120 B
        95%:          144 B
        99%:          376 B
        99.9%:        5144 B
        99.99%:       19456 B
    Hold duration:
      Min .. Max:
        Min:          0 ms
        Avg:          11.586 ms
        Max:          40000 ms
      Percentiles:
        50%:          0 ms
        90%:          0 ms
        95%:          0.1 ms
        99%:          100 ms
        99.9%:        200 ms
        99.99%:       20000 ms
GC stats:
  RAM used:           16.029 -> 20.386 GB
  Thread pauses:
    % of time frozen: 35.812 %
    # per second:
      Min .. Max:
        Min:          1 /s
        Avg:          760.402 /s
        Max:          839 /s
      Percentiles:
        50%:          771 /s
        90%:          802 /s
        95%:          819 /s
        99%:          836 /s
        99.9%:        837 /s
        99.99%:       837 /s
  Global pauses:
    % of time frozen: 33.965 %
    # per second:     735.363 /s
    Pause duration:
      Min .. Max:
        Min:          0 ms
        Avg:          0.462 ms
        Max:          45.296 ms
      Percentiles:
        50%:          0.416 ms
        90%:          0.517 ms
        95%:          0.569 ms
        99%:          1.166 ms
        99.9%:        7.483 ms
        99.99%:       22.55 ms
InCerryGit commented 4 months ago

@sgf Thank you for your benchmark results. It seems that ServerGC has made significant improvements compared to WorkstationGC. The 20% pause might be due to insufficient memory allocation, resulting in a longer pause. Once the GC adapts to the load, the pause time improves tenfold compared to WorkstationGC.

Name Workstation GC Server GC (20% RAM) Server GC (50% RAM)
Duration 10 s 10 s 10 s
Thread count 12 12 12
Static set 50% RAM 20% RAM 50% RAM
Total size 16 GB 6.4 GB 16 GB
Object count 186.218 M 74.482 M 186.218 M
Operations per second 42.114 M/s 90.414 M/s 90.499 M/s
Bytes per second 3.618 GB/s 7.768 GB/s 7.775 GB/s
Size (Min / Avg / Max) 32 B / 92.25 B / 131072 B 32 B / 92.25 B / 131072 B 32 B / 92.25 B / 131072 B
50% Percentile 40 B 40 B 40 B
90% Percentile 120 B 120 B 120 B
95% Percentile 144 B 144 B 144 B
99% Percentile 376 B 376 B 376 B
99.9% Percentile 5144 B 5144 B 5144 B
99.99% Percentile 19456 B 19456 B 19456 B
Hold duration (Min / Avg / Max) 0 ms / 11.586 ms / 40000 ms 0 ms / 11.586 ms / 40000 ms 0 ms / 11.586 ms / 40000 ms
50% Percentile 0 ms 0 ms 0 ms
90% Percentile 0 ms 0 ms 0 ms
95% Percentile 0.1 ms 0.1 ms 0.1 ms
99% Percentile 100 ms 100 ms 100 ms
99.9% Percentile 200 ms 200 ms 200 ms
99.99% Percentile 20000 ms 20000 ms 20000 ms
RAM used (start -> end) 16.03 -> 18.152 GB 6.427 -> 7.867 GB 16.029 -> 20.386 GB
Gen0, # per second 1466.439 /s - -
Gen1, # per second 380.298 /s - -
Gen2, # per second 13.091 /s - -
Thread Pause
% of time frozen 47.137 % 35.697 % 35.812 %
# per second (Min / Avg / Max) 371 /s / 784.962 /s / 1157 /s 2 /s / 852.207 /s / 1153 /s 1 /s / 760.402 /s / 839 /s
50% Percentile 785 /s 866 /s 771 /s
90% Percentile 1032 /s 1100 /s 802 /s
95% Percentile 1075 /s 1126 /s 819 /s
99% Percentile 1147 /s 1140 /s 836 /s
99.9% Percentile 1149 /s 1144 /s 837 /s
99.99% Percentile 1149 /s 1144 /s 837 /s
Stop The World
% of time frozen 44.096 % 32.53 % 33.965 %
# per second 658.928 /s 757.986 /s 735.363 /s
Pause duration (Min / Avg / Max) 0 ms / 0.669 ms / 443.255 ms 0 ms / 0.429 ms / 544 ms 0 ms / 0.462 ms / 45.296 ms
50% Percentile 0.579 ms 0.212 ms 0.416 ms
90% Percentile 0.958 ms 0.314 ms 0.517 ms
95% Percentile 1.012 ms 0.355 ms 0.569 ms
99% Percentile 1.278 ms 1.187 ms 1.166 ms
99.9% Percentile 3.704 ms 6.21 ms 7.483 ms
99.99% Percentile 7.301 ms 508.731 ms 22.55 ms
sgf commented 4 months ago

@sgf Thank you for your benchmark results. It seems that ServerGC has made significant improvements compared to WorkstationGC. The 20% pause might be due to insufficient memory allocation, resulting in a longer pause. Once the GC adapts to the load, the pause time improves tenfold compared to WorkstationGC.

The improvement is not big enough, not radical enough. I hope that the ideal STW pause should be less than 5ms. And the memory within 256G can be easily controlled. Regardless of the number of objects, the STW time and frequency should not be affected.

Since my PC memory is limited, if you want to get the ideal results, you need to use a server for testing. When developing large-scale Web sites or large-scale game servers, 48GB,64GB, 96GB, and 128GB are common memory.

InCerryGit commented 4 months ago

Here is the information about .NET 9 Preview 6 running on a 32-core, 256GB server with 189GB of memory: c5e46f56f2e11ff33aec17ec101914fb ServerGC

c1aa3e08f7005ce56ac374d9e098ce0f ServerGC + DATAs

Go only achieved a result of 126GB because it ran out of memory (OOM) at 189GB. Due to the lack of generational garbage collection and compaction in Go, the memory usage is significantly higher. For the 126GB test, it required 217GB of memory to complete.

image

image