Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k
stars
721
forks
source link
Checkpoint - investigate suspending threads before prepare and resuming after restore #13751
Investigate if it is possible to suspend running Liberty threads before doing a prepare operation for checkpoint and then resume the threads after running the restore hooks when restoring the process.
This is to protect threads from resuming "too early" before any restore hooks have successfully restored their state to acceptable levels.
/**
* Sets the prepare hook which is called after pausing all application threads and before the process checkpoint
* is done.
* <p>Default: null
*
* @param prepare a function run after the JVM has paused all application threads and before the JVM checkpoint is performed
* @return this
*/
public CRIUSupport setPrepare(Callable<Boolean> prepare) {
...
}
/**
* Sets the restore hook which is called before resuming all suspended threads.
* <p>Default: null
*
* @param restore a function run after the JVM has restored but before resuming all suspended threads
* @return this
*/
public CRIUSupport setRestore(Callable<Boolean> restore) {
...
}
On checkpoint:
Call checkpoint API
JVM suspends all other Threads
-------- enter single threaded phase ----------
Run application hooks
Run JVM hooks
Checkpoint JVM
On Restore:
Restore JVM
Run JVM hooks
Run application hooks
-------- exit single threaded phase ----------
Resume all other threads
Return from checkpoint API
We also need to handle potential deadlock cases when checkpointing thread has a dependency on a suspended thread.
There's some options here:
Inject a CheckpointInProgressException into the blocked thread and resume it so it exits the monitor by throwing to nearest catch point for the CIPE. All the same problems as Thread::stop. Mentioned for completeness but not a good option.
Wake the blocked thread up and allow it to proceed until it releases the monitor the hook needs. Block it after that. May require waking other threads to allow complicated synchronized accesses to proceed (ie: Thread A holds Lock1, waiting on Lock2, held by Thread B, etc).
Have the application implement its own coordination layer to quiesce its threads akin to the volatile boolean threadSuspended suggested in [1].
Throw a "DeadLockDetectedException" if the application hooks ever attempt to acquire a lock that's already held by another thread. Would prevent Snapshot and report the problem on a thread that may be ready to handle it by ie waiting and retrying the snapshot later.
OMR functions omrintrospect_threads_* may be helpful. I believe they're currently only used when produce javacore files, but should be reusable for this.
Investigate if it is possible to suspend running Liberty threads before doing a prepare operation for checkpoint and then resume the threads after running the restore hooks when restoring the process.
This is to protect threads from resuming "too early" before any restore hooks have successfully restored their state to acceptable levels.
On checkpoint:
On Restore:
We also need to handle potential deadlock cases when checkpointing thread has a dependency on a suspended thread.
There's some options here:
Inject a CheckpointInProgressException into the blocked thread and resume it so it exits the monitor by throwing to nearest catch point for the CIPE. All the same problems as Thread::stop. Mentioned for completeness but not a good option.
Wake the blocked thread up and allow it to proceed until it releases the monitor the hook needs. Block it after that. May require waking other threads to allow complicated synchronized accesses to proceed (ie: Thread A holds Lock1, waiting on Lock2, held by Thread B, etc).
Have the application implement its own coordination layer to quiesce its threads akin to the volatile boolean threadSuspended suggested in [1].
Throw a "DeadLockDetectedException" if the application hooks ever attempt to acquire a lock that's already held by another thread. Would prevent Snapshot and report the problem on a thread that may be ready to handle it by ie waiting and retrying the snapshot later.
related to: https://github.com/OpenLiberty/open-liberty/issues/19040