dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.72k forks source link

Is vxsort with workstation GC a common scenario? #84749

Closed MichalStrehovsky closed 1 year ago

MichalStrehovsky commented 1 year ago

We currently enable VXSORT in GC under following ifdefs:

https://github.com/dotnet/runtime/blob/6875ba02dbde6477abe16f0863348e617cc03fcb/src/coreclr/gc/gc.cpp#L21-L25

I experimentally modified the ifdef by adding && defined(SERVER_GC). With that, the size of a selfcontained NativeAOT hello world went down by ~10%. We currently build two flavors for NativeAOT runtime - one with the server GC and one with the workstation GC. The server GC is opt in through the publicly documented ServerGarbageCollection MSBuild property and not the default.

VXSORT No VXSORT
Default 1.41 MB 1.30 MB
InvariantGlobalization 1.19 MB 1.08 MB

(It is likely the InvariantGlobalization case will be the default for projects created with dotnet new console -aot.)

This feels like a pretty good saving. The question is - would this be a compromise? Is having WKS GC with heaps large enough that we would use VXSORT a common scenario? We want to have comparable perf with Native AOT, but also size is one of the things we take seriously.

Cc @dotnet/gc @dotnet/ilc-contrib

ghost commented 1 year ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Issue Details
We currently enable VXSORT in GC under following ifdefs: https://github.com/dotnet/runtime/blob/6875ba02dbde6477abe16f0863348e617cc03fcb/src/coreclr/gc/gc.cpp#L21-L25 I experimentally modified the `ifdef` by adding ` && defined(SERVER_GC)`. With that, the size of a selfcontained NativeAOT hello world went down by ~10%. We currently build two flavors for NativeAOT runtime - one with the server GC and one with the workstation GC. The server GC is opt in through the publicly documented `ServerGarbageCollection` MSBuild property and not the default. | | VXSORT | No VXSORT | |------------------------|---------|-----------| | Default | 1.41 MB | 1.30 MB | | InvariantGlobalization | 1.19 MB | 1.08 MB | (It is likely the InvariantGlobalization case will be the default for projects created with `dotnet new console -aot`.) This feels like a pretty good saving. The question is - would this be a compromise? Is having WKS GC with heaps large enough that we would use VXSORT a common scenario? We want to have comparable perf with Native AOT, but also size is one of the things we take seriously. Cc @dotnet/gc @dotnet/ilc-contrib
Author: MichalStrehovsky
Assignees: -
Labels: `area-NativeAOT-coreclr`
Milestone: -
jkotas commented 1 year ago

The GC team has been working on a GC with dynamic heap count (https://github.com/dotnet/runtime/pull/84168). I expect that the GC with dynamic heap count is going to become the universal default at some point (.NET 9+) and the workstation vs. server choice will become irrelevant for most part.

We may want to tie the omitted VX sort to <OptimizationPreference>size</OptimizationPreference>.

Sergio0694 commented 1 year ago

Would it make sense and/or be possible to have that size optimization just set some default value for another property that just controls whether vxsort is included, while still being able to also just set that manually? Rationale: if you have a NAOT application where you care about speed but which doesn't deal with a high amount of allocations (first example that comes to mind, a renderer of some sort, like this ComputeSharp NAOT sample), it could be nice to set the size preference mode to speed while still dropping vxsort, as that wouldn't really be useful anyway 🤔

PeterSolMS commented 1 year ago

The impact on GC pause time of dropping vxsort is likely to be a few percent, depending mostly on the number of objects surviving GC. Sorting is essentially O(n*log(n)), so the relative impact on GC pause time grows with the number of surviving objects.

The vxsort code won't kick in below 8 *1024 objects (because the CPU may reduce clock speed when we use AVX2 instructions), but that threshold is not hard to exceed in actual practice.

It probably would be useful to take some traces with CPU samples to see what the impact on realistic scenarios actually is.