eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.26k stars 717 forks source link

Compiler Options Processing Post Restore Design #16714

Open dsouzai opened 1 year ago

dsouzai commented 1 year ago

Service Requirements

Policies

There are several ways in which options provided post restore can be implemented. The rest of this section goes over the various policies that can be associated with the options. Note, there’s no reason that only one of these policies needs to hold for all options; different options can have different policies.

1. Options only apply from the restore point onward.

The options do not impact anything that was compiled before the restore; the options only affect the environment/compiles post restore.

2. Options apply in a manner similar to AOT loads to the extent possible

For the most part options do not impact anything that was compiled before restore; however, there are ways to control execution of code that was compiled pre checkpoint, though it isn’t guaranteed that the code can never execute.

3. Run the checkpoint as if the option was set

In the restore if it’s not set, then all future compilations do not need to be restricted; if it is set then existing compiled code does not need to be invalidated. Depending on the option, this policy can either mean the option is in effect, or that the option isn’t in effect but initialization contingent on the option occurs.

4. If an option cannot be applied retroactively, fail the checkpoint and start a new JVM in default mode

To completely guarantee that all options are applied post restore, abandon the restore and start the JVM in normal mode.

5. If an option cannot be applied retroactively, OSR out and recompile

To completely guarantee that all options are applied post restore, OSR out of code that is invalidated by the (post restore) options.

6. Option does not apply

The option does not apply in the restore run. The option can be ignored or the restore can fail.

Post Restore Options Semantics

Option Comment Policy Implemented (PR #)
-Xaot Since this is default, this is something that is only worth considering if the checkpoint run was run with -Xnoaot or -Xint. If -Xnoaot was specified pre checkpoint, there is a bunch of initialization that must occur post restore; therefore, if -Xnoaot is specified pre-checkpoint, -Xaot is ignored post-restore. For -Xint, see -Xjit below. 3 https://github.com/eclipse-openj9/openj9/pull/16716, https://github.com/eclipse-openj9/openj9/pull/16838
-Xnoaot Disable further AOT loads and compilation. 1 https://github.com/eclipse-openj9/openj9/pull/16838
-Xjit Since this is default, this is something that is only worth considering if the checkpoint run was run either with -Xnojit or -Xint. Either way, the checkpoint run should run with the JIT initialized. 3 https://github.com/eclipse-openj9/openj9/pull/16716, https://github.com/eclipse-openj9/openj9/pull/16838
-Xnojit Disable all previously compiled code and disable compiling JIT code. 2 https://github.com/eclipse-openj9/openj9/pull/16838
-XCompilationThreads Check how many are allocated vs activated and allocate more if needed on checkpoint run. 2 https://github.com/eclipse-openj9/openj9/pull/16769
-XlockReservation Theoretically possible to get working, but not for GA. 6 n/a
-Xquickstart Does not make sense to support. 6 n/a
-XsamplingExpirationTime A heuristics flag, does not impact pre checkpoint state. 1 https://github.com/eclipse-openj9/openj9/pull/16838
-Xlp Does not make sense to support. 6 n/a
-Xlp:codecache Does not make sense to support. 6 n/a
-XtlhPrefetch This is the default, and the option to disable this is not externally documented. 6 n/a
-Xcodecache Barely used by service, so does not make sense to support. 6 n/a
-Xcodecachetotal Barely used by service, so does not make sense to support. 6 n/a
-XX:codecachetotal Barely used by service, so does not make sense to support. 6 n/a
-XX:[+\|-]MergeCompilerOptions Only applies for new options. 1 https://github.com/eclipse-openj9/openj9/pull/16716
-XX:[+\|-]RuntimeInstrumentation Revisit this when InstantOn is supported on zLinux 6 n/a
-XX:[+\|-]UseJITServer Run as if this option is set, but suppress the code that tries to establish the connection. 3 https://github.com/eclipse-openj9/openj9/pull/15674
-XX:JITServerAddress Since no connection is established until post restore, this only applies post restore. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-XX:[+\|-]JITServerLocalSyncCompiles Since no connection is established until post restore, this only applies post restore. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-XX:[+\|-]JITServerLogConnections Since no connection is established until post restore, this only applies post restore. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-XX:JITServerPort Since no connection is established until post restore, this only applies post restore. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-XX:JITServerSSLRootCerts Since no connection is established until post restore, this only applies post restore. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-XX:JITServerTimeout Since no connection is established until post restore, this only applies post restore. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-Xjit / -Xaot count options Only apply for new loaded methods; applying this for existing interpreted J9Methods from the checkpoint is a nice to have for GA. 1 https://github.com/eclipse-openj9/openj9/pull/16769
-Xjit / -Xaot exclude & limit options For serviceability, this needs to apply to methods from the checkpoint as well. However, as stated in the Service Requirements section, it is ok that this can’t be completely guaranteed, specifically when a method to be filtered is on the stack at the point the checkpoint occurs. 2 https://github.com/eclipse-openj9/openj9/pull/16716
-Xjit:verbose= / -Xjit:vlog= / -Xjit:rtLog= Open files post restore, closing old ones if needed. 2 https://github.com/eclipse-openj9/openj9/pull/16838
-Xjit:disableAsyncCompilation May be possible if one can generate the pre-prologue to always have space for the helper call; however, this is out of scope for GA 6 https://github.com/eclipse-openj9/openj9/pull/16838
Remaining -Xjit / -Xaot These only apply to new compilations. 1 https://github.com/eclipse-openj9/openj9/pull/16716, https://github.com/eclipse-openj9/openj9/pull/16838
-Xrs & -Xtrace https://github.com/eclipse-openj9/openj9/issues/16839 1 https://github.com/eclipse-openj9/openj9/pull/16852
-Xaggressive A heuristics flag, does not impact pre checkpoint state 1
-Xtune:virtualized A heuristics flag, does not impact pre checkpoint state. 1
-Xshareclasses If changes are allowed, we need to reinitialize relevant infrastructure 1 or 6
-Xjit:enableGPU This is not portable, therefore this only applies post restore. 1
-XX:[+\|-]PerfTool / -Xjit:perfTool Having this set at JVM start will result in disabling code cache reclamation. 1
-Xjit:optLevel= Given that this option does not impact AOT loads, a similar stance can be taken for methods pre checkpoint. 1
Disable Recompilation optLevel=, inhibit recomp, etc. 1

Related Considerations

Implementation

Open Questions

Notes

Relevant PRs

OMR

dsouzai commented 1 year ago

@vijaysun-omr @ymanton @mpirvu @tajila

vijaysun-omr commented 1 year ago

You may want to check with @mpirvu or @klangman that the following is okay with them (maybe the answer is different for the initial GA related code completion date in a month vs longer term than that).

jit / -Xaot count options | Only apply for new loaded methods; applying this for existing interpreted J9Methods from the checkpoint is a nice to have for GA. -- | --
vijaysun-omr commented 1 year ago

For disableAsyncCompilation are you planning to solve using 1 or 6 ? The table lists it could be either.

vijaysun-omr commented 1 year ago

Have you finalized what will be done for -Xshareclasses ? My question is first to see if we can write down what exactly we are planning to do/support clearly, rather than necessarily push for any change in that plan for the initial GA.

dsouzai commented 1 year ago

@tajila what is the -Xshareclasses plan in terms of specifying it post-restore, at least for 0.38? Are we going to support modifying it it any way (e.g., allowing a user to specify -Xshareclasses:none post restore)?

Similar question for -Xint; I think if a user specifies -Xint pre-checkpoint, then it's probably reasonable to not bother trying to configure the JIT post restore, but what about the situation where a user specifies -Xint post-restore?

dsouzai commented 1 year ago

For disableAsyncCompilation are you planning to solve using 1 or 6 ? The table lists it could be either.

Spoke to @vijaysun-omr offline, but for posterity:

The issue comes from the fact that the preprologue shape is different based on sync or async compilation. A lot of code just checks the global state for the type of compilation that generated the body. If post restore, the mode changes, then in order to not have crashes, we at the very least need to have some information per body to inform the patching code.

It may be possible to have the shape of the preprologue always be prepared to switch modes, but this is not something we can achieve for 0.38, so for now we're just going to ignore the option.

tajila commented 1 year ago

Similar question for -Xint; I think if a user specifies -Xint pre-checkpoint, then it's probably reasonable to not bother trying to configure the JIT post restore, but what about the situation where a user specifies -Xint post-restore?

Im not sure we can fully support this in time for GA, but its certainly doable. One approach is to add an option -XintOnRestore that essentially prevents any further JIT transistions, but this would simply be a stop gap until -Xint can be supported. So its not a usefull feature longterm.

@tajila what is the -Xshareclasses plan in terms of specifying it post-restore, at least for 0.38? Are we going to support modifying it it any way (e.g., allowing a user to specify -Xshareclasses:none post restore)?

I feel like this is really an AOT question. From a VM perspective, what comes out of SCC will be identical to loading from jar. so long as things are in sync, which we have checks for.

dsouzai commented 1 year ago

Im not sure we can fully support this in time for GA, but its certainly doable. One approach is to add an option -XintOnRestore that essentially prevents any further JIT transistions, but this would simply be a stop gap until -Xint can be supported. So its not a usefull feature longterm.

Yeah I agree. So I suppose for 0.38 then, if someone specifies -Xint post restore, it just gets ignored

From a VM perspective, what comes out of SCC will be identical to loading from jar. so long as things are in sync, which we have checks for.

Well I suppose what we need to do for AOT depends on what is allowed to change. For example, if the user changes the SCC, I will need to double check but I don't think it affects how compilation occurs because we always access the SCC via the SCC APIs. However, if the user specifies -Xshareclasses:none then we may need to reflect that so that we don't try to use the SCC APIs.

vijaysun-omr commented 1 year ago

Using -Xint and -Xshareclasses:none after restore both fall in the category of someone wanting to work around a problem (with the JIT or SCC respectively). In theory we could get them close to what they would have gotten with -XintOnRestore to prevent interpreter to JIT transitions post restore with -Xjit:exclude={*} and that would mean we don't JIT anything after restore at least. While this exclude option won't prevent interpreter to JIT transitions post restore, at the end of the day -XintOnRestore also won't do a "perfect" job of never running JITed code after restore. For that we will do the "proper" -Xint support after the initial release, including OSR out of JITed code and preventing future transitions to JITed code. Maybe this is the long form argument for "doing nothing" about -Xint in this release, do you agree Irwin and Tobi ? I'll add a different comment about -Xshareclasses:none shortly.

dsouzai commented 1 year ago

Maybe this is the long form argument for "doing nothing" about -Xint in this release, do you agree Irwin and Tobi ?

Yes, given that -Xint has very specific behaviour that we can't achieve post-restore in 0.38, and given that there are ways to minimize the time spent in JIT'd code with things like -Xnojit -Xnoaot (yet to be implemented at the time of this comment) or -Xjit:exclude={*} -Xaot:loadExclude={*} (~implemented at the time of this comment~ not working but I have a fix), having another option like -XintOnRestore as a stopgap would essentially just be an alias of these existing options, and therefore would be unnecessary work both wrt code and documentation.

vijaysun-omr commented 1 year ago

-Xshareclasses:none after restore may be useful in case there is some customer issue with SCC (as has happened anecdotally and this option was provided as a workaround). In terms of loading classes after the restore from jar file as opposed to SCC, this is still a change in logic that would need to be implemented, right ? i.e. someone would need to check after restore that we have switched the SCC off in the VM and decide to go to the jar file for future loads instead. So, in that sense, it isn't zero work ?

From an AOT perspective, I am thinking we will just support this to the extent of a) not loading AOT code and b) not doing any AOT compilations after the restore in 0.38. I believe we won't use SCC apis to do any compiles, unless an AOT compilation has been queued (which is what we need to prevent from happening at the control layer). Again, this is not zero work but maybe this is what you are thinking of supporting, Irwin.

Apart from -Xshareclasses:none there is the question of other -Xshareclasses options. I assume we don't plan on supporting any of those other options for 0.38, but in theory, a) changing the SCC location and b) changing the SCC and/or AOT code size could be the options I can see some questions about in the future. It would be good to get a brief description of our long term position on this, but confirming what we are going to support in 0.38 is the primary goal that I am after.

tajila commented 1 year ago

Maybe this is the long form argument for "doing nothing" about -Xint in this release, do you agree Irwin and Tobi ?

I agree, I think we can leave out -Xint for the initial release.

-Xshareclasses:none after restore may be useful in case there is some customer issue with SCC (as has happened anecdotally and this option was provided as a workaround). In terms of loading classes after the restore from jar file as opposed to SCC, this is still a change in logic that would need to be implemented, right ? i.e. someone would need to check after restore that we have switched the SCC off in the VM and decide to go to the jar file for future loads instead. So, in that sense, it isn't zero work ?

There are two ways of approaching this. The brute force way is to NULL the javaVM->sharedClassConfig, this will result in no ROM classes, AOT, SCC related MXBEAN metrics and SCC API functioning after restore. Its probably as close to -Xshareclasses:none as we will get on restore. The downside is that we will need to update the JVM in all the places that may expect the config to exist if it did on startup. Also if a user had frames on the stack that were using the SCC API, there may be unexpected behaviour.

The second approach is to just not return anything from the SCC. So on restore, during classload the VM would just not query the SCC, the JIT would have to do the same when looking for AOT code. SCC metrics and SCC API would continue to work as normal though. I would argue this is more like -Xshareclasses:disableLoadingOnRestore as it doesn't quite behave the same as -Xshareclasses:none

The second approach seems more feasible, and something we can likely get done within the 0.38 timeframe.

tajila commented 1 year ago

I assume we don't plan on supporting any of those other options for 0.38, but in theory, a) changing the SCC location and b) changing the SCC and/or AOT code size could be the options I can see some questions about in the future. It would be good to get a brief description of our long term position on this, but confirming what we are going to support in 0.38 is the primary goal that I am after.

I think it is possible support some version of -Xshareclasses:none for 0.38.

a) changing the SCC location

After speaking with @hangshao0 he feels this wont be feasible on restore

b) changing the SCC and/or AOT code size could be the options

This seems possible, but not a small task

vijaysun-omr commented 1 year ago

Thanks Tobi, I agree with the "second approach" that you mentioned for -Xshareclasses:none for this (0.38) release since it does feel cleaner.

hangshao0 commented 1 year ago

The second approach is to just not return anything from the SCC. So on restore, during classload the VM would just not query the SCC, the JIT would have to do the same when looking for AOT code. SCC metrics and SCC API would continue to work as normal though. I would argue this is more like -Xshareclasses:disableLoadingOnRestore as it doesn't quite behave the same as -Xshareclasses:none

I assume some new -XX option(s) need to be added to tell the JVM not to load things like classes/AOT from SCC post restore ? I believe we will disable storing these new data into SCC as well.

hangshao0 commented 1 year ago

@ehrenjulzert, can you look at implementing the second approach mentioned here https://github.com/eclipse-openj9/openj9/issues/16714#issuecomment-1452248303 ?

tajila commented 1 year ago

I assume some new -XX option(s) need to be added to tell the JVM not to load things like classes/AOT from SCC post restore ? I believe we will disable storing these new data into SCC as well.

Im in favour of using a new name, since this behaviour is not identical to -Xshareclasses:none thoughts @vijaysun-omr @dsouzai

We have a mechanism for supplying options on restore, so its just a matter of doing something like FIND_ARG_IN_ARGS(vm->checkpointState.restoreArgsList, ...) to find the option restore.

dsouzai commented 1 year ago

I'm ok with using a new name as I've already set up some infra to get the post-restore options, i.e.: https://github.com/eclipse-openj9/openj9/blob/58fb3c25e7c02616411b2bb6e9d18be2af3b519a/runtime/compiler/control/OptionsPostRestore.cpp#L43-L44

That said, in normal startup we don't check the options directly, but rather: https://github.com/eclipse-openj9/openj9/blob/58fb3c25e7c02616411b2bb6e9d18be2af3b519a/runtime/compiler/control/DLLMain.cpp#L504-L525

Is that something that we can do again, or is it that because of the J9Hook mechanism it is possible that the J9HOOK_VM_PREPARING_FOR_RESTORE JIT hook gets called before the SCC hook?

Also, to @hangshao0's point, will there also be a -Xshareclasses:disableStoringOnRestore or do we just want to support something like -Xshareclasses:disableOnRestore?

hangshao0 commented 1 year ago

I prefer a single option that disables both SCC finding/storing on restore, which is closer to the behaviour of -Xshareclasses:none.

tajila commented 1 year ago

Also, to @hangshao0's point, will there also be a -Xshareclasses:disableStoringOnRestore or do we just want to support something like -Xshareclasses:disableOnRestore?

I prefer -Xshareclasses:disableOnRestore which does everything.

hangshao0 commented 1 year ago

What is the expected behaviour of using the new option where CRIU is not supported. If we silently ignore it, then it looks like it should be a -XX option (there are existing SCC -XX options like -XX:+PortableSharedCache, -XX:SharedCacheHardLimit=, etc). Currently for all the -Xshareclasses: sub-options, we exit the JVM with an error message if an unsupported option is found.

Also we don't consume multiple -Xshareclasses: options. The last one wins. All previous ones are ignored. The only exception is -Xshareclases:none, which is never ignored.

hangshao0 commented 1 year ago

Is that something that we can do again, or is it that because of the J9Hook mechanism it is possible that the J9HOOK_VM_PREPARING_FOR_RESTORE JIT hook gets called before the SCC hook?

The struct sharedClassConfig is prepared and set to vm->sharedClassConfig during phase: JIT_INITIALIZED. After this phase, all the fields and exposed APIs in sharedClassConfig can be checked and used by other components. After the new option to disable SCC is parsed, we will turn off the exposed APIs (related to finding/storing operations) to return directly. vm->sharedClassConfig won't be set to NULL and J9SHR_RUNTIMEFLAG_CACHE_INITIALIZATION_COMPLETE won't be removed, But before restore options are consumed, I am not sure if there could be a timing window these APIs can be used. I am not familiar with the restore process.

The VM will parse the new option. We can expose a flag/API about whether the new option is found or not to JIT, if JIT does not parse the option itself.

dsouzai commented 1 year ago

The VM will parse the new option. We can expose a flag/API about whether the new option is found or not to JIT, if JIT does not parse the option itself.

Yeah I guess it depends on the order of operations during restore; if the JIT hook gets invoked after the flag has been set then we don't have to parse it, otherwise we will. I'm ok either way, but it's worth knowing for sure.

Also on a related note, I don't think (for 0.38 anyway) that I can guarantee that some SCC API won't get invoked by a non-compilation compiler thread. @hangshao0, what are the consequences of something like that? For example, if -Xshareclasses:disableOnRestore is specified, post restore but some thread calls vm->sharedClassConfig->findSharedData.

hangshao0 commented 1 year ago

what are the consequences of something like that? For example, if -Xshareclasses:disableOnRestore is specified, post restore but some thread calls vm->sharedClassConfig->findShared

It will behave as if the data is not in the SCC. findShared() returns 0 (0 data element found in the SCC) and nothing will be returned via J9SharedDataDescriptor.

hangshao0 commented 1 year ago

Yeah I guess it depends on the order of operations during restore; if the JIT hook gets invoked after the flag has been set then we don't have to parse it, otherwise we will. I'm ok either way, but it's worth knowing for sure.

Talked to @JasonFengJ9, the restore options is parsed here: https://github.com/eclipse-openj9/openj9/blob/5a4ab53779ef5aa6e623d0ef4f34bcec4530683e/runtime/criusupport/criusupport.cpp#L696-L707

which happens before jvmRestoreHooks() that triggers J9HOOK_VM_PREPARING_FOR_RESTORE

https://github.com/eclipse-openj9/openj9/blob/5a4ab53779ef5aa6e623d0ef4f34bcec4530683e/runtime/criusupport/criusupport.cpp#L728

tajila commented 1 year ago

Also we don't consume multiple -Xshareclasses: options. The last one wins. All previous ones are ignored. The only exception is -Xshareclases:none, which is never ignored.

if -Xshareclasses:disableOnRestore is specified on startup, the JVM should fail to start with:

JVMJ9VM007E Command-line option unrecognised: -Xshareclasses:disableOnRestore
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

This option should be limited to restore

dsouzai commented 1 year ago

@hangshao0 To check if the SCC is disabled post-restore, do I just check if J9SHR_RUNTIMEFLAG_CACHE_INITIALIZATION_COMPLETE is not set in vm->sharedClassConfig->runtimeFlags? Or do you recommend I just search for the -Xshareclasses:disableOnRestore option?

hangshao0 commented 1 year ago

do I just check if J9SHR_RUNTIMEFLAG_CACHE_INITIALIZATION_COMPLETE is not set in vm->sharedClassConfig->runtimeFlags? Or do you recommend I just search for the -Xshareclasses:disableOnRestore option?

J9SHR_RUNTIMEFLAG_CACHE_INITIALIZATION_COMPLETE will always be set, so checking this flag won't work. @ehrenjulzert is adding an variable to indicate if -Xshareclasses:disableOnRestore presents, you can check that.

dsouzai commented 1 year ago

Hm, well because we need to get the code in by Friday and the JIT changes will have to wait until the SCC changes get merged in, I'll just explicitly search the option for now, and add an item in https://github.com/eclipse-openj9/openj9/issues/16853 to use the variable you're mentioning.

0xdaryl commented 1 year ago

@dsouzai : this is targeted to 0.40. Is there outstanding work to be delivered in the next few days or should this move out?

dsouzai commented 1 year ago

No we should move this out; nothing on this front is going to get delivered to 0.40.

0xdaryl commented 1 year ago

OK, moving to 0.41.