JIT differences in reporting call sites in GC info

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

14.56k stars 4.54k forks source link

JIT differences in reporting call sites in GC info #103917

Open jakobbotsch opened 1 week ago

jakobbotsch commented 1 week ago

The JIT has an optimization for MinOpts compilations where we do not bother reporting call sites at all: https://github.com/dotnet/runtime/blob/71ab8f1434b1b472f76afe565f1984f4ae5eb87f/src/coreclr/jit/gcencode.cpp#L4497-L4500

This optimization is conditioned on noTrackedGCSlots: https://github.com/dotnet/runtime/blob/71ab8f1434b1b472f76afe565f1984f4ae5eb87f/src/coreclr/jit/gcencode.cpp#L4068-L4070

Note that JitMinOptsTrackGCrefs is 0 for xarch but 1 everywhere else, so the optimization only kicks in for xarch: https://github.com/dotnet/runtime/blob/b79c57efbfbec28f021453d0ca95ad3769a8336a/src/coreclr/jit/jitconfigvalues.h#L525-L530

From my measurements the optimization saves around 23% on GC info size reported by MinOpts contexts. As a downside it does not allow for the suspension on return addresses that @VSadov implemented recently. We should decide whether we want to enable it generally or not.

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jkotas commented 1 week ago

This change came from https://github.com/dotnet/coreclr/pull/9869/files#diff-081cd7404db539556560f8746ba2f1928ec14424fd64192aba02316bedac4183R219 . I do not see a reason for having platform differences in GC encoding like this one. I think we should enable the x64 path everywhere.

I think it is fine to have worse GC suspension behavior with minopts. Minops means suboptimal performance, and GC suspension tail latency is part of the bundle.

the optimization saves around 23% on GC info size reported by MinOpts contexts.

The primary motivation for the original change was JIT throughput improvement. The GC info size reduction was nice side-effect.

VSadov commented 1 week ago

Also, it is the highly optimized code with tight loops and tiny calls that benefits from more interruption points. MinOpts will likely still be fine.

jakobbotsch commented 1 week ago

The primary motivation for the original change was JIT throughput improvement. The GC info size reduction was nice side-effect.

I imagine the main throughput win within the JIT came from avoiding computing liveness for the GC pointers, and as a consequence this required that we reported all GC pointers as untracked (resulting in the size win). That part is enabled everywhere even today (by virtue of liveness analysis not running in MinOpts).

The size optimization here is somewhat separate: it is realizing that if a call site has no interesting GC information associated with it, then we can avoid describing at all in the GC info that any call site exists there. Once we stopped tracking GC-typed locals I suppose this became true in MinOpts, but I think this size optimization would have been equally applicable in optimized code that ended up without any interesting GC information at a specific call site before @VSadov's recent work.

VSadov commented 1 week ago

I think this size optimization would have been equally applicable in optimized code that ended up without any interesting GC information at a specific call site before @VSadov's recent work

I think this is correct. We have some code that relies on presence of call sites/safepoints (like extra interruptibility points, asserts) or on enumerating all safepoints in GC info (like GC stress), but those are all basically reliability enhancements and not required to just make code run.

There should be a lot more opportunities for this optimization in MinOpts, so we could not pass by and enabled it. For optimized code, I'd expect the opportunity will be limited, so perhaps it is not worth it, considering that "uninteresting" safepoints are still useful, even if not strictly required.

If it is easy to measure, I wonder what would be impact if this is enabled in optimized code. At least as a data point.