[NativeAOT][Proposal] GC stress support for NativeAOT

VSadov commented 6 days ago

We need a GC-stress story for Native AOT. We rely on JIT-based GC stress in CoreCLR, but once in a while we have a stress bug specific to NativeAOT, that could be found a lot earlier if we had GC stress infrastructure that could target NativeAOT directly.

There are some remnants of GC stress support in the NativeAOT codebase. The code appears to be old - since before switching to RuyJIT and targets different style of suspension (i.e. completely synchronous with polling and loop hijacking). For the current design that code is not very useful and could be mostly removed.

Experiments with simpler stress approaches like an extra thread blasting GC.Collect() in a loop did not yield convincing results. Varying the rate of collections, however smart, makes it either not stressful enough, too expensive or both. It may be an interesting approach for a quick/adhoc stressing of some scenarios, but as a general-purpose GC stress mechanism it appears to be a dead-end. It is better than nothing, but that is not a very inspiring bar to clear.

An approach, similar to CoreCLR, while more complex conceptually, could be more promising. The idea of instrumenting safepoints with illegal instructions and then stress-and-fix them as faults are encountered will be more thorough while also less redundant as one location is tested roughly once.

Rough sketch of the idea:

There is a moment in time when managed code for a given AOT module is finished loading.
That is when native AOT-compiled code is all present in memory, various rehydrations, OS relocations, linker relaxations are all done and the code is ready to start executing.
This is the moment when the runtime analyses the loaded module and performs final preparations such as to ensure that boundaries of managed code could be recognized and the code can participate in GC stackwalks. This is also the time when runtime can instrument the code for GC stress.
First we make a copy of existing code. Currently the range of managed code for a given module is contiguous, so that makes it even easier as we can simply make a copy of all the managed code in the module at once. It is possible to enumerate method-by-method, but there is no need.
The runtime requires that there is a way to associate an instruction location in managed code to a particular method info. Mechanisms may vary (i.e. simplified RtlLookupFunctionEntry on Windows), but being able to associate IP with a method is a requirement, which means native method bodies can be enumerated and associated with method infos and GC infos.
The runtime requires that for a given method all GC-safe points can be enumerated. There are several ways for this information to be encoded/inferred, but ultimately runtime must be able to answer a question – “is this location GC-safe if a thread is interrupted there?”. That is the same question as “is this location that we need to stress?” For example in fully-interruptible methods every instruction within interruptible ranges would be stressable.

There is a small caveat for ISAs with variable instruction size (i.e. x64). We may need to resort to partial disassembling so that given a safepoint location, we could figure the size of the safepoint instruction. That is also the case in CoreCLR implementation and same disasm support could be used.

Every safe point is patched with known illegal/privileged instruction (HLT or similar). That is – for the entire module. The cost is proportional to the size of code, just like loading/relocating, so should be feasible, and likely negligible next to the cost of subsequent per safe-point GC stress. Most platforms require something like FlushInstructionCache after all is patched to ensure coherence of instruction updates.
Some safe points may not be explicitly known (i.e. MinOpts may optimize away safepoint info in some cases), so we may miss these while enumerating all interruptible locations. That is the same as on CoreCLR. If there is no way to identify a safe point location, it would not be possible to asynchronously interrupt for GC, for the same reasons. In a way the stress still covers all interruptible points and scenarios.
We used to have a requirement that a method metadata for the calee must be obtainable in order to make safepoints at call sites instrumentable. To emulate hijack in the calee upon returning, we’d need to know what calee returns so we could protect the return with a special frame, if necessary. Figuring calees of indirect calls was particularly inconvenient. With recent changes in GC info encoding, presence of metadata is no longer a requirement. Relaxing that requirement helps AOT case as well.
Once code starts executing, we special-case faults due to illegal instructions in a way similar to what CoreCLR does. Every fault will result in causing or participating in a GC. After that, while runtime is still suspended, copy over the original instruction to not stress the same location repeatedly. This may need some disassembling (applied to the backup copy), to figure the size of the instruction on some platforms. FlushInstructionCache or similar applies here too.
There are scenarios/features like WaitForPendingFinalizers that are incompatible with stress. We will have to add a way to disable stress temporarily (similarly to CoreCLR it can be a recursive counter of suppressions).
There could be some difficulties with managed NoGC code like GC callouts. Possibly all we need is to ignore faults in NoGC mode, restore the original instruction and resume. It may also possible that the same stress-suppression mechanism as mentioned above is sufficient here too.
In theory we can use any build flavor, including release, however for better object validation, it could be preferrable to use special build flags or flavor for the native runtime (release CorLib would be ok though). It may still be required to do a special build to support GC stress suppression, if that has too much cost to be present in Release. Whether a special runtime build is required is something that needs to be found out and decided. It is not a blocker.
Some platforms may not allow modifying native code after it is loaded (i.e. MacOS?). It should be possible to modify on Windows/Linux, both on x64 and arm64. That might be sufficient for most needs.

dotnet-policy-service[bot] commented 6 days ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

SingleAccretion commented 6 days ago

Could this stress be driven by the Jit?

The idea would be to insert something like this in emit after all safe points (interruptible instructions).

If feasible, it would have the advantage of needing fewer modifications to the rest of the system (e. g. all the places that store native code pointers in runtime data structures).

filipnavara commented 6 days ago

Could this stress be driven by the Jit?

The idea would be to insert something like this in emit after all safe points (interruptible instructions).

That may be preferable to support targets like macOS, or even iOS. That said, I kinda like the idea of using hlt-like instruction, which could be inserted by JIT too. The reason is that it's actually closer to the signal processing done in actual GC. Also, any additional complex code emitted in JIT may break scenarios that we would like to test. For example, frameless methods [on ARM64] cannot make calls so they would not be testable by the approach in JIT_StressGC which makes the GC call directly.

VSadov commented 6 days ago

Could this stress be driven by the Jit?

The idea would be to insert something like this in emit after all safe points (interruptible instructions).

If feasible, it would have the advantage of needing fewer modifications to the rest of the system (e. g. all the places that store native code pointers in runtime data structures).

I think it is a very important advantage of HLT instrumentation that it does not require different emit. We would be testing exactly the same code with the same GC info shape. The HLT only ensures that every reachable location will be tested and generally only once.

JIT-inserted probes could work acceptably when the number of safe points is modest (i.e. call sites, loop back branches), but with fully interruptible code where every instruction is interruptible it could get awkward. Runtime will need to remember what locations were covered (a hash table I guess, which can get pretty large). That will prevent retesting, but the probes will still keep firing and check the table for roughly every instruction.

There are scenarios where JIT needs to insert NOP/BRK in the instruction stream to make sure the same safepoint cannot be reached with different GC info or be in different EH regions. Will adding a bunch of interleaving probe calls make this scenario easier or harder? Are the probes themselves interruptible? (they will have to be in fully interruptible code), so will we have to extend interruptible ranges for the probes at the ends? I'd rather not answer these questions.

Inserting probes at JIT time can be made to work too, but instrumentation after loading feels closer to testing the original code.

Even for MacOS, if there is a lot of desire, there could be a way. Somehow debugger can put breakpoints after all... These HLT are basically a bunch of single-use breakpoints.

SingleAccretion commented 6 days ago

I think it is a very important advantage of HLT instrumentation that it does not require different emit. We would be testing exactly the same code with the same GC info shape.

Agreed.

Runtime will need to remember what locations were covered (a hash table I guess, which can get pretty large). That will prevent retesting, but the probes will still keep firing and check the table for roughly every instruction.

One doesn't need a hash table for this - the Jit can emit an inline check for each probe, or pass the flag address to the helper:

cmp [location_probed_flag], 1 # One byte per each probe site, generated at compile time
je SKIP
call RhpStressGc

I also agree it is not clear this would be acceptably fast for the fully interruptible case. It is known that it is acceptably fast for the partially interruptible one (we have such a scheme implemented in NativeAOT-LLVM).

I suppose the code pointers problem is not that hard to solve - you 'just' need to record all of the static relocations (including for code itself, to adjust RIP-relative addresses) and re-apply them as appropriate at runtime when the code is copied.

VSadov commented 6 days ago

CC: @janvorli @mangod9

dotnet / runtime

[NativeAOT][Proposal] GC stress support for NativeAOT #107850