Open tajila opened 1 year ago
Based on a discussion with @vijaysun-omr, we came up with a few possible ways forward.
This is relatively straightforward to do; in fact this is what we currently do for -Xtrace
/-Xrs
. However, the problem is that this does not guarantee that some JIT'd code will not execute; any JIT'd code on the stack will continue to execute until a new invocation wherein execution will transfer to the interpreter.
This is probably less of an option for the JVM and more for Applications; an application can be configured to handle the failure and instead start a new JVM in default mode. This would not maintain Dev/Prod Parity, but it is a fallback option that would at the very least guarantee functionality from a Java Application User pov.
Generating code as if the JVM is in FSD mode means running in Involuntary OSR Mode. This means any yield point can be a place where the VM triggers the transition of a thread from JIT'd code to the interpreter. The downside of this approach is that FSD compliant JIT code is around 30% slower. However, this may not matter too much for first response; for steady state throughput, these FSD bodies can be generated with GCR trees to force recompilation post restore.
An important subtlety here is that if debut is not enabled post-restore but redefinition is still possible, the code cache will have some method bodies that support involuntary OSR (i.e. those that were generated pre-checkpoint) and the rest that support voluntary OSR. As such, the VM will need to check a (yet to exist) flag in the body's metadata to determine what type of OSR was used. When redefinition needs to occur, the VM will need to check, at a yield point, if the body was compiled to support involuntary OSR, and if so, decompile it regardless of the type of yield point; otherwise, normal Voluntary OSR mechanics apply.
If option 3 is too expensive, another approach is to run in a suboptinal Voluntary OSR Mode. Rather than run the Fear Analysis to minimize the OSR transition points, we force the transition points to be the exact set of yield points that are used to ensure that redefinition occurs; while this set is larger than what would result from an optimal OSR analysis, it is still likely smaller than the set of points in option 3.
However, an important caveat here is that any yield point that is not used to ensure that redefinition occurs must be ignored by the VM for the purpose of checkpointing; the thread should be allowed to continue execution until it hits one of these yield points that is also a transition point (it is guaranteed that the thread will not execute indefinitely before reaching such a point).
Another caveat is that we will need to add Voluntary OSR support for AOT (https://github.com/eclipse-openj9/openj9/issues/4849).
I am going to start investigating the perf impact of option 3 first. Specifically, I will generate two builds where,
FYI @gacholio
any yield point that is not used to ensure that redefinition occurs must be ignored by the VM for the purpose of checkpointing; the thread should be allowed to continue execution until it hits one of these yield points that is also a transition point
I'm not too familiar with this detail, how do we differentiate this in the VM? There are two main mechanism we use, exclusive and safepoint exlcusive. @gacholio thougts.
My impression from discussion with Tobi is that we would just discard all the compiled code if debug was enabled on restore. This avoids any number of difficult issues. The checkpoint code uses safepoint exclusive, so all threads will certainly be at an OSR point.
@gacholio that is captured in Irwin's case 3 and 4. From my understanding, what Irwin is saying is that the JIT either needs to be in FSD mode (non default) or Voluntary OSR (default) mode for us to decompile the JIT frames on the stack.
The checkpoint code uses safepoint exclusive, so all threads will certainly be at an OSR point.
To me this sounds like we could then use case 4, which is the cheaper option.
To me this sounds like we could then use case 4, which is the cheaper option.
The OSR I'm talking about is I believe involuntary, in that we force it on all threads (it's not induced by a failed check in the compiled code). Does involuntary require FSD? I didn't think so.
Does involuntary require FSD?
FSD involves involuntary OSR; normal HCR enabled mode uses voluntary OSR.
So either we need to start in involuntary mode always (or at least if we want to support the possibility of debug) or add guards at every OSR point to check for the switch (maybe this can be done via the assumptions mechanism?).
add guards at every OSR point to check for the switch (maybe this can be done via the assumptions mechanism?).
Well, once the guards are patched it will always transition to the VM. As such, once we enter into debug mode, the entire code cache might as well be discarded (same with the AVL trees). However, if we don't enter debug mode, the code quality should be better than with involuntary osr mode.
Also, with this approach, at the time when the VM wants to stop threads to prepare for checkpoint, if the thread hits some other yield point that isn't an OSR transition point, it needs to be allowed to return back to running JIT'd code; it's only in involuntary osr mode that all yield points are OSR transition points. That's why if we can get away with involuntary osr mode pre-checkpoint, that would be the simplest approach to take.
if the thread hits some other yield point that isn't an OSR transition point, it needs to be allowed to return back to running JIT'd code;
This is the part that is challenging. Im not sure how we detect this.
As such, once we enter into debug mode, the entire code cache might as well be discarded
I believe we will be reinitializing the send targets for all methods when we restore, which has the effect of abandoning all of the compiled code (by which I mean the interpreter will never invoke it again), so normal CCR should be able to discard the old method bodies once every running invocation has OSRed back to the interpreter.
This is the part that is challenging. Im not sure how we detect this.
Let's not do this - it's essentially another layer of exclusive on top of safepoint, which would be completely unmanageable (I'd already like to see some proof that safepoint is valuable given how many problems it has had).
This is the part that is challenging. Im not sure how we detect this.
@vijaysun-omr could elaborate more on this perhaps, but he did mention that there's only very specific bytecodes that matter for the purpose of (in a normal run) ensuring that we yield to allow a redefinition event (for example, if we're in a loop with no monents/invocations, we need to ensure that we don't loop indefinitely).
If there's some way to identify at the yield point / transition point what the bytecode is supposed to, we would be able to distinguish between normal yield points and OSR transition points. Of course the critical point here is that the set of osr transition points must be the set of yield points that are necessary to ensure a redefinition event. It may also be that when we transition via OSR, we end up in a different place than when we yield via a yield point, so that too could be a distinguishing factor.
That said, I don't know if what I just described is absolutely accurate, so I'll let Vijay clarify.
I am under the impression that under our present default HCR implementation, the VM only allows actual class redefinition to occur at certain yield points, and my understanding is that those yield points are 1) async checks 2) method calls (probably via stack overflow check) and 3) monitor enter.
If this is not how the VM is doing class redefinition, then please clarify. If this is how the VM is doing class redefinition, then I don't understand what more is needed in order to support option 4 in Irwin's post.
Redefinition can occur at any place that releases VM access. These would include:
With some exceptions, if you call out from compiled code, that's a redef point (some JIT helpers will never release VM access, so we'll need to be very careful in future if we change a helper and the JIT has assumed it will not release VM access).
The only practical solution for compiled code is to discard it entirely on restore (i.e. post decompiles for every compiled frame in every thread). This will naturally result in the debug interpreter being invoked after the decompiles.
Safepoint HCR means that object allocation is not an OSR/decompile point (the checkpoint code gets that kind of access if necessary).
The requirement is that we have an OSR block at all of the possible locations that a method could be paused (by safepoint exclusive). I'd rather not rely on guards to accomplish this since it would be very hard to distinguish which points will rely on the guard fail and which need to be forced into OSR.
When we restore, we will mark all frames in all stacks for decompile, and reset all method send targets back to their default (count and compile in the JIT case). Eventually, the obsolete compiled code will be unrerefenced and able to be reclaimed.
That list of program points in compiled code from @gacholio where class redefinition may occur (ignoring FSD for the moment) is what we used to have, until some more OSR changes were done to the design a few years ago was my understanding. The basis of this understanding is this code :
The code under the if-condition I pasted only checks for calls, async checks and monitor enters as spots where it needs to arrange for OSR transitions ("post execution OSR" there means it will set up the OSR transition after those operations are done and we return back to the JITed code) https://github.com/eclipse/omr/blob/2d5ac63fbe881f0af035ef2732b22f85eb3893dd/compiler/compile/OMRCompilation.cpp#L637
There is also this comment that alludes to what that code does: https://github.com/eclipse-openj9/openj9/blob/163a51495d5fb2b004ba596846de1738b454bcc2/runtime/compiler/optimizer/OSRGuardInsertion.cpp#L649
There must have been some VM code added to ensure we only redefine at those 3 points since the JIT is not in charge of where class redefinition occurs. The point of debate being this category which the above JIT code does not seem to consider anymore as a place where redefinition is possible:
changes were done to the design a few years ago
You are likely referring to safepoint OSR, which only eliminates object allocation from the list of HCR points:
Looking at the code, in HCR (not FSD) mode, the VM does not force decompile anywhere - it calls jitClassesRedefined
with a list of modified classes/methods so the JIT can patch what it needs to.
So, I suppose it's up the JIT to determine where HCR checks need to be inserted to ensure correctness.
One thing I think we've all forgotten (and I've just remembered) is that HCR does not affect existing frames on the stack. The requirement is that all new method invocations target the most current version of the method.
This may mean that existing HCR/OSR is not sufficient to accomplish what's needed here as we will be unable to simply discard the code cache like we do for FSD (extended) HCR.
There are two different concepts at play here:
In the case of FSD, i.e. when we use Involuntary OSR, the sets of these points end up being the same from the point of view of the JIT because all those yield points mentioned by Gac are decompilation points.
In the case of default HCR, i.e. when we use Voluntary OSR, from the point of view of the JIT, redefinition and decompilation points are not necessarily the same. In general, a thread yields to the VM to allow a STW redefinition event to occur, and then the thread continues executing until it reaches a decompilation point. The only yield points that could be redefinition points are, as Vijay mentioned, asynccheck
, calls, and monents
. This can be seen here:
the selected if
branch above is what runs by default.
What Option 4 in https://github.com/eclipse-openj9/openj9/issues/17642#issuecomment-1602783510 proposes is to essentially make the set of redefinition points (from the JIT's pov in Voluntary OSR mode) also the set of decompilation points. This can be implemented in two ways:
1 is obviously the cleaner approach, but 2 may be more practical in terms of being able to reuse non-FSD infrastructure.
At any rate, the question of what are redefinition points and what are decompilation points is an orthogonal concern to Option 4 above, which banks on the fact that we must already able to distinguish between the two for HCR.
All that said, if the assumption that redefinition cannot occur outside of asynccheck
, calls, and monents
is wrong, then HCR has a longstanding bug independent of the CRIU feature. As far as the JIT is concerned right now, it generates code assuming that redefinition can only occur at these three types of yield points. The code comment linked in https://github.com/eclipse-openj9/openj9/issues/17642#issuecomment-1604879758 was first added around May 2016. @gacholio do you know what VM changes were added around that time frame that might explain why that comment exists?
All that said, if the assumption that redefinition cannot occur outside of asynccheck, calls, and monents is wrong, then HCR has a longstanding bug independent of the CRIU feature.
Clasically, HCR could occur any time VM access can be released. That includes all of the places (and possibly more) that I detailed above.
The only HCR change I can think of is the safepoint OSR (which I think you refer to as nextGenHCR
). This disallows HCR at object allocation points.
When the HCR occurs, the VM does not add any decompilations - it reports the modified classes/methods so the JIT can do the appropriate patching (presumably invalidating calls to any potentially-replaced methods). As stated above, there's no need to decompile when the thread resume - it's fine to wait until a new method invocation is going to take place (even then, if you know that the invoked method has not been replaced, you can just go ahead and invoke it).
It's tempting to use voluntary OSR to let the decompiles trickle in as the compiled code detects the restore, but this won't work properly in the debugger (an obvious example is that the debugger would not be able to query locals in frames that remain compiled without FSD).
I think the only way this will work is to make every escape point (except allocation points in next gen) from the compiled code into an OSR point, and do the force decompile (involuntary) on restore.
It's tempting to use voluntary OSR to let the decompiles trickle in as the compiled code detects the restore, but this won't work properly in the debugger (an obvious example is that the debugger would not be able to query locals in frames that remain compiled without FSD).
@gacholio that sounds a lot like Graeme's @SelectiveDebug
technology from many many years ago. Is that a reasonable approach to build off where existing frames are marked in some way to indicate they can't be debugged (and use the correct stackmapper) and new invocations are debuggable?
How valid this is depends on the user requirements but it seems like a reasonable position to me.
@SelectiveDebug technology
I don't see the correlation , and I would have to say no to building on top of 20-year old abandoned tech (I doubt there's even a mention of it left in the codebase). It also does not address my above concern about locals.
We don't want to reuse the @SelectiveDebug
tech but the idea of allowing a mix of debuggable and not debuggable frames is worth considering. The locals in non-debuggable frames would simply be unavailable - I believe there's an existing JVMTI error (JVMTI_ERROR_OPAQUE_FRAME
) to return from the locals related queries that covers this behaviour
The only HCR change I can think of is the safepoint OSR (which I think you refer to as nextGenHCR). This disallows HCR at object allocation points.
After talking to @jdmpapin and Vijay, I believe that the three types of yield points I mentioned above do cover most of what is handled by safepoints. However, it may be that the resolve helpers are not handled; we'll have to take a look and see if we do handle it in some other way; either way we would have to make them an explicit OSR point.
I think the only way this will work is to make every escape point (except allocation points in next gen) from the compiled code into an OSR point, and do the force decompile (involuntary) on restore.
Yeah that sounds right. Actually, additionally we need to make these points also the only points that a thread can yield to allow a checkpoint. Essentially, in Option 4, we need to have the set of Redefinition Points (Escape Points/HCR Points), the set of OSR Points (Involuntary OSR Transition Points), and the set of Checkpoint Points be the same set of points.
Overall though, I do agree that if FSD compliant code pre-checkpoint is sufficient then we should just stick to that.
I launched some perf runs to measure the impact of generating FSD compliant code. I ran the pingperf
and restcrud
apps; as the names suggest, pingperf
is a simple OpenLiberty app that responds to a request with a response, whereas restcrud
queries a postgres db and returns the results.
I had 3 builds:
pingperf
Build | Startup Slowdown | First Response Slowdow |
---|---|---|
FSD Always | 5% | 4% |
FSD Pre-checkpoint | 4% | 3% |
restcrud
Build | Startup Slowdown | First Response Slowdow |
---|---|---|
FSD Always | 2.5% | 15% |
FSD Pre-checkpoint | 2.5% | 2% |
From the looks of things, the FSD approach (i.e. Option 3) looks to be sufficient to enable debug post-restore.
That said, there are some things that we need to address.
the VM will need to check a (yet to be defined) flag in the method's metadata to see if it is a FSD body
They will all be FSD bodies in solution 3, won't they? A combination of two existing facilities (decompile all methods in all stacks and reset method send targets) will allow us to do this.
Does this not effectively make the pre-checkpoint code irrelevant to non-debug restore (other than the speed getting to the checkpoint I suppose). None of the pre-checkpoint code will be run post-restore.
Does this not effectively make the pre-checkpoint code irrelevant to non-debug restore (other than the speed getting to the checkpoint I suppose). None of the pre-checkpoint code will be run post-restore.
That isn't the point of the FSD bodies; if none of the pre-checkpoint code is supposed to run post-restore in a non-debug world, then there wouldn't be any reason to generate compiled code in the first place pre-checkpoint. The code compiled pre-checkpoint contributes to speeding up time to first response (post-restore).
So, given that we need compiled code post-restore, and also that we need to be able to transition to the VM if debug is required post-restore, generating FSD code allows us to address both requirements.
However, the subtlety I was mentioning comes from the fact that if debug is not specified post restore, then now we have method bodies compiled pre-checkpoint that do not have OSR guards, and so if redefinition occurs because of HCR then for those bodies, the VM will have to trigger involuntary OSR. For methods compiled post-restore, they will have OSR guards and so the transitions will occur via the voluntary OSR mechanism.
Should be simple enough to make the "decompile all" I mentioned only do it for FSD bodies once the metadata query is made available - for a standard debug run all of the bodies will be FSD, so they will all pass the new test.
If we don't toss the FSD bodies on restore, anything compiled pre-checkpoint will not be recompiled, if I read your comments above correctly. This will likely include a lot of base JCL code which will be used post-restore.
If we don't toss the FSD bodies on restore, anything compiled pre-checkpoint will not be recompiled, if I read your comments above correctly.
Yeah that's with my current prototype, because we've never been in a situation where we've wanted to recompile bodies based on sampling/profiling in FSD mode. However, now I have to extend the control infra to allow recompilation in the specific situation, post-restore, where we have FSD code but we're not in FSD mode.
OK, once you provide a MethodMetaData query for the FSD flag, I'll update the VM code. This should be completely transparent for normal runs. I can see it being interesting to have FSD bodies that don't recompile during the checkpoint run, but do recompile after restore. Maybe just reset the sampling counters on restore?
I opened https://github.com/eclipse-openj9/openj9/pull/17798 for the metadata flag.
I can see it being interesting to have FSD bodies that don't recompile during the checkpoint run, but do recompile after restore. Maybe just reset the sampling counters on restore?
Yeah that's the general approach I'm thinking of taking; will need to find good count values to ensure we recompile methods a bit more aggressively post-restore.
I think we'll want to add a flag to optimize this case - something indicating that we are post-restore with FSD bodies possibly running on stacks while we are now in fast HCR mode. On the first HCR with the flag set, we'll do the full decompile of every stack and clear the flag so that future HCRs can avoid scanning every stack unnecessarily.
On the first HCR with the flag set, we'll do the full decompile of every stack and clear the flag so that future HCRs can avoid scanning every stack unnecessarily.
I'm not sure if we support decompiling a method doesn't use Involuntary OSR if the redefinition doesn't impact that method. That is, with Involuntary OSR, the VM can trigger the transition to the interpreter without the need to patch any guards in compiled code. However, with Voluntary OSR, the only way execution can transition from compiled code to the interpreter is by patching a guard, which is done via the Runtime Assumptions.
When the first HCR occurs, any thread that has a FSD body on the top of the stack can be decompiled, but those threads that don't will need to wait until the next yield point when an FSD body is at the top of the stack. The only other way to truly decompile when you have Voluntary OSR is to tell the JIT that all classes have been redefined, but I don't know what the consequence of that is in terms of the CH Table and what not.
@vijaysun-omr can confirm if my assessment is accurate or not.
To clear up any possible confusion, note that all decompiles occur upon return, with voluntary or involuntary OSR. The topmost frame is special only in that it has called out to a helper (the same mechanism is used to patch the return address of helper calls and calls to methods).
Though I didn't say it earlier, I assume we'll only be decompiling frames that are running the FSD bodies. Those should be prepared for a decompile at every escape point.
I assume we'll only be decompiling frames that are running the FSD bodies.
Ah I see, I read "we'll do the full decompile of every stack" to mean every thread.
However, with respect to
future HCRs can avoid scanning every stack unnecessarily
I don't think we can get away with that. The reason being we could have a situation where we do have an FSD body on the stack but it's not the top most method (which is compiled with Voluntary OSR). In that case, we can't decompile that stack (unless the top most method was invalidated). This means that we could still have some FSD bodies on some stacks even after the first HCR.
The FSD bodies will not be executed - they will be decompiled as soon as they are returned to. Assuming you'll do the appropriate fixups in any compiled code that was calling to an FSD body (or patching the FSD bodies to immediately recompile), no FSD-compiled code will execute after the first HCR.
The FSD bodies will not be executed - they will be decompiled as soon as they are returned to.
Oh, I didn't understand that's what you meant by "all decompiles occur upon return". How does that work when you have a JIT method returning to another JIT method. I was under the impression that when a thread running FSD compiled code yields and the VM wishes to decompile, it causes the thread to jump to the helper. Is it that the VM updates the return address on the stack to return to the helper instead? If it's the latter then on the first HCR, the RA of the top-most FSD body on a stack will need to be updated (rather than just the top-most method).
Assuming you'll do the appropriate fixups in any compiled code that was calling to an FSD body (or patching the FSD bodies to immediately recompile)
Yeah the idea is to recompile everything ASAP (perhaps even in the pre-checkpoint hook), but we'll still need to ensure that a recompilation failure is handled for JIT to JIT invocations.
The stack walker knows where the return addresses are stored, so we just patch that slot to return to the decompile helper (the VM maintains a side stack of decompile records that among other things contain a copy of the original return address). This works for all the escapes from the compiled method, be it to another compiled method, an interpreted method or a helper call.
I see; then yeah as long as RA of the top most FSD method on the stack is patched (and not any possible compiled methods above it that are not FSD), then your proposal would work.
How can I know the status of the issue?
The compiler work is being tracked here https://github.com/eclipse-openj9/openj9/issues/18866 (it does include some VM pre-requisites). For the most part, the compiler functional work is done, but we still need to reduce the footprint gap caused by generating FSD pre-checkpoint.
How can I know if now is a good time to address the VM pre-requisites as it seems like we still need to reduce the footprint gap caused by generating FSD pre-checkpoint?
The footprint gap and the VM pre-prerequisites are mutually exclusive; the work to reduce the footprint gap is not going to be impacted by the necessary VM changes.
That said you should probably coordinate with @JasonFengJ9 since I believe he's working on the VM side debug on restore work.
@JasonFengJ9 how can I potentially assist with the debug on restore work?
The first openj9 portion of the debug on restore work was
The corresponding extension repo PR (initially opened by Mike Z., now with my changes) is awaiting review
I have a draft PR for the second openj9 PR which is being tuned according to Irwin's perf results, the ETA is next week or so.
There are quite a few other CRIU open issues, please talk to @tajila for a suitable task.
It seems like a suitable task was discussed after I talked to @tajila for a suitable task.
How can I contribute to the task?
Background
We currently have 3 interpreters, normal one, criu and debug. Ideally, we would like to get in a position where we only have two interpreters, normal and debug. The CRIU interpreter was added because there were capabilities missing (method enter/exit checks) in the normal interpreter to support serviceability features like java method tracing dynamically upon restore.
Goal
Detect request to run with debug interpreter, then exit normal interpreter and continue in debug interpreter. If we can achieve this then we gain
Challenges
Places to detect change: