Reduce HCR impact on the effectiveness of escape analysis

vijaysun-omr commented 3 years ago

HCR (Hot Code Replace) is enabled by default in OpenJ9 and makes the escape analysis optimization in the JIT compiler more conservative. The affected part of escape analysis is primarily the method peeking logic, that is essentially disabled at levels below the top level call. This is related to two things

1) Peeking of methods is fundamentally limited in a run where HCR is enabled because the result of peeking cannot be used by the analysis without some means of compensating for the case when the peeked method gets changed due to a class/method redefinition in the future 2) The ability to compensate for peeked methods changing is limited to the top level call Ii.e. a call that is in the IL trees when escape analysis is run) since that is the only call for which we are generating code in the compiled method. The IL for peeked methods at lower levels of the call graph are only examined during the course of the method peeking and are discarded after the method peek is over (thereby also losing any ability to compensate for any changes at those lower levels due to redefinition). At the top level call, a change was made last year that essentially guards a peeked method with an HCR guard and if we redefined the method being called, then the guard would be patched to take a slow path that performs a heapification before calling the redefined method (following a similar scheme to what we do for guarded inlined calls (on a cold path) and cold block heapification in escape analysis already).

Given this inability to peek methods at lower levels of the call graph when HCR is enabled, escape analysis is still not at anywhere close to full effectiveness by default (since HCR is enabled by default). This issue aims to address this and offer a way for escape analysis to become as effective as before HCR was enabled by default in the expected common case when no classes are actually being redefined by the user.

The proposal is to allow escape analysis to peek methods at levels lower in the call graph, and define a scheme whereby we track which compiled methods have had objects stack allocated based on method peeked at lower levels of the call graph.

In the unlikely event that any class gets redefined (we can optionally also track the list of classes that are relevant to escape analysis' peeking if we wanted to be more accurate), the VM would need to do a few things at the class redefinition event (before "activating" the redefined class)

Invalidate all the compiled bodies that relied on peeking to stack allocate objects
OSR out of all such compiled bodies and transition to the interpreter
Heapify all stack allocated objects dynamically by allocating new heap objects and copying over the contents from the corresponding stack allocated objects. Update all registers and stack slots that pointed at a stack allocated object to instead point at the corresponding heap allocated object.

This kind of a scheme would allow escape analysis to become more aggressive in the presence of HCR and still maintain functional correctness.

@tajila and @0xdaryl fyi

vijaysun-omr commented 3 years ago

@Swapnilr1 your work in the dynamic heapification area could be of use here.

Swapnilr1 commented 3 years ago

@vijaysun-omr @0xdaryl I have created the pull request #13326 - could you or someone from the VM team have a look and help me with the questions I have (added as comments in the PR itself) ? Thanks.

gacholio commented 3 years ago

@vijaysun-omr Quite some time ago, you asserted that EA/stack allocation was essentially worthless outside of synthetic benchmarks. Has this changed?

vijaysun-omr commented 3 years ago

I think that the limitations imposed due to HCR were affecting what we stack allocated quite significantly in some evaluations the CAS team were trying to do with the DaCapo benchmark suite, i.e. my memory was that -XX:-EnableHCR was stack allocating several times more objects compared with the default when HCR was enabled. @Swapnilr1 and/or @manasthakur do you have that data ?

vijaysun-omr commented 3 years ago

I saw some of the data from a few months back for HCR enabled vs HCR disabled and we stack allocated approximately 9-10X more in most runs with HCR disabled vs when HCR was enabled. This was true both from a static (compile time) count as well as a dynamic (run time) count of stack allocations done in the two modes.

However, this result was only on one of the Dacapo benchmarks and run with count=0 to provide some degree of stability in the results from run to run.

I don't expect the basic conclusion to change though if we did runs with other applications, i.e. that there is significantly more stack allocation presently with HCR disabled than with it enabled (regardless of the magnitude in question, be it 3X, 5X or 10X).

gacholio commented 3 years ago

First off, we'll need a new MMD query to see if a compiled method needs to be force OSRed (for now, I'll just pass the metadata - in the future we can get fancier).

Note that OSR does not occur during the HCR - the frame is marked for decompilation which takes place when the frame is returned to. The heapification will need to occur here as we can't allocate heap objects during HCR (exclusive VM access is held).

For consistency, this should really be the responsibility of the OSR block - if the allocation fails, the OOM needs to be thrown before the stack changes. Consider this:

public void whatever() {
   Thing t = new Thing();                <- stack allocated
   try {
      ...code...                         <- HCR occurs here
   } catch (OutOfMemoryError e) {
      use t                              <- t is invalid in the interpreter
   }

If we use dynamic heapfication, the OSR block may not need to handle these cases, though this is a bit different than the intended use (as I understand it).

gacholio commented 3 years ago

@vijaysun-omr Does OSR do any heapification today, or are stack allocated objects only allowed to exist between OSR points?

vijaysun-omr commented 3 years ago

No heapification is done by OSR code today.

gacholio commented 3 years ago

Then I think dynamic is the only practical way to go (which will introduce some overhead in the interpreter, but probably not much). In the case above, each use of t will attempt the heapficiation until one succeeds or the OOM propagates out.

If this proves problematic, we may be able to use some trickery - acquire exclusive, mark all mutator threads as "halted for heapification", release exclusive and then perform the heapification of the appropriate frames. I'll think some more on this.

Note that the current dynamic implementation does not allow SA->SA references.

vijaysun-omr commented 3 years ago

Yes I think the plan was to look at SA->SA heapification later (since this would require nested object heapification support). It is possible that @Swapnilr1 has this implemented but plans to pull request it later, or that some other student who will continue the work that he started under @manasthakur will add support for that kind of heapification as part of the project around more optimistic stack allocation.

With the existing escape analysis implementation I doubt that disallowing the SA->SA case (due to HCR effects) will have a significant impact on the amount of stack allocation that occurs.

gacholio commented 3 years ago

As briefly discussed today, this approach will not be able to support scalarized objects (object fields simply assigned to locals without obeying the object layout) as there is no sensible way to reconstruct the object. This should only apply to methods which did the peeking.

I assume that if the SA object is passed to another method, it cannot be scalarized at all.

Another problem I see is this:

void a(SA o) { <--- this method does peek
   ...HCR decompiles this meaning o will be eventually dynamically heapified...
}
void b() { <--- this method does not peek
   SA o = new SA(); <--- stack allocated
   ...assign fields of o...
   a(o);
}

o will need to be contiguous in order to be passed as an arg, but presumably the JIT would like to access the fields of the SA object directly, not via the object pointer. This fails when o is heapfied - the code will replace the stack reference to the object with the heap version, at which point if the code in b accesses the stack fields directly, it's incorrect.

I think this means that every frame that has a stack allocated object will need to be decompiled when the HCR takes place to prepare for the heapification. Maybe this can be improved by expanding the new "must be force OSRed" flag to include methods which stack allocate and pass that object as a method parameter.

gacholio commented 3 years ago

Is the case above already considered "peeking"?

vijaysun-omr commented 3 years ago

As briefly discussed today, this approach will not be able to support scalarized objects (object fields simply assigned to locals without obeying the object layout) as there is no sensible way to reconstruct the object. This should only apply to methods which did the peeking.

I think this is fine. The escape analysis infrastructure allows us to track which objects were stack allocated reliant on peeking some call, and disallow those objects from being considered for scalarization (i.e. discontiguous stack allocation).

I assume that if the SA object is passed to another method, it cannot be scalarized at all.

Correct.

Another problem I see is this:
void a(SA o) { <--- this method does peek
   ...HCR decompiles this meaning o will be eventually dynamically heapified...
}
void b() { <--- this method does not peek
   SA o = new SA(); <--- stack allocated
   ...assign fields of o...
   a(o);
}
o will need to be contiguous in order to be passed as an arg, but presumably the JIT would like to access the fields of the SA object directly, not via the object pointer. This fails when o is heapfied - the code will replace the stack reference to the object with the heap version, at which point if the code in b accesses the stack fields directly, it's incorrect.

I believe this was a problem that @Swapnilr1 encountered in his implementation and we may have disabled the code in the JIT to access the stack allocated object fields directly using the stack pointer in those cases. It would be good for @Swapnilr1 or @0xdaryl to comment on this since I believe this may have been done in the code generator and may be code in the JIT that needs to be pull requested in order to make the dynamic heapification support more robust towards the problem Graham mentioned. The end result was that the stack allocated object reference was evaluated into a register and that register was used to do the field accesses (as opposed to using the stack pointer directly) making it possible for the dynamic heapification to work for those accesses too. I don't believe it matters which real register gets used to access fields in stack allocated objects from a performance viewpoint, i.e. it was just a minor optimization anyway that we can give up on.

vijaysun-omr commented 3 years ago

I'll define what we mean on the JIT side when we refer to "peeking". Peeking refers to the act of looking into the bytecodes of some call that was still present (i.e. not inlined) in the compiled method in order to learn something relevant for the analysis in question, e.g. in the case of escape analysis it involves tracking what happens to the argument values passed in.

Peeking is done recursively, i.e. we can peek methods that get called from a method that we are peeking already up to a maximum call depth and total peeked bytecode size threshold that is defined in the analysis doing the peeking, as an effort to keep compile time under control in case we are descending down a large call graph of methods. We are also more imprecise in tracking how arguments are used in peeked methods (as opposed to the compiled method itself) since we do not have the compile time budget to build use def and value numbering information for those peeked methods. So while we can track our way through some basic uses of the arguments, it is not that hard to defeat the peeking logic in which case it will give up and be conservative, i.e. not stack allocate the object that was passed as an argument.

vijaysun-omr commented 3 years ago

I believe there is also the effect that the OSR helper call themselves have on escape analysis' effectiveness since they have all the locals/parms used by the method as arguments in the IL trees. There are many "voluntary" OSR transition points chosen by the JIT's analyses for inserting OSR transition points to account for the possibility of HCR happening, and these points all appear to be escapes from a stack allocation perspective because we do not want to enter the interpreter with any stack allocated objects (in the case when HCR occurred and required us to activate some OSR transition point in the method). Dynamic heapification would solve this problem since the objects would already have been heapified before we activated the OSR transition point.

gacholio commented 3 years ago

OK, I assumed you would need to look at the called method to determine if the SA object could be safely passed as an argument - this restriction goes away with dynamic heapification.

With this strategy, every GC point (other than allocations) in every compiled method that peeks must be an OSR point, since we can't control when HCR occurs. Allocation is excluded due to the OSR safe point work we did recently.

vijaysun-omr commented 3 years ago

OK, I assumed you would need to look at the called method to determine if the SA object could be safely passed as an argument - this restriction goes away with dynamic heapification.

There are three separate cases in which dynamic heapification can help with wrt peeking limitations.

We peeked a method but it changed at run time due to HCR

We reached some threshold that did not allow us to complete peeking all the bytecodes in the call graph

We peeked the method implementations that we saw at compile time but a new implementation was loaded due to dynamic class loading afterwards

1 depends on HCR and would be handled by the feature we are discussing in this issue.

2 and 3 do not have any direct relationship to HCR and so the functionality of dynamic heapification triggered when an HCR occurs will not help improve matters in those cases (we are conservative today in those cases too by default). For overcoming 2 and 3 you need a mechanism where you trigger dynamic heapification when an object escapes via a write barrier and this is what @Swapnilr1 research project was about. With that work, we would be able to optimistically stack allocate in more cases (significantly more we hope) than was even the case before HCR was enabled by default.

gacholio commented 3 years ago

Assuming that any compilation that allows an SA object to escape via argument passing will maintain a strong link to that object (i.e. SA object pointer stored in a live object local in the caller), I don't think we need any special tagging of those frames.

When dynamic heapification occurs for a given SA object, the stack will be walked to:

update all live references to the SA object with the heap object (all frames)
force OSR any JIT frame that was updated in the above test

We will only need to tag metadata with the "I peeked" flag. During HCR, force OSR any running frames with the metadata tag.

Does this sounds right?

vijaysun-omr commented 2 years ago

I agree that we would need to force OSR any JIT frame that had stack allocated an object in its frame with the "I peeked" flag.

I am wondering if we do in fact need to additionally force OSR any JIT frame that was updated in the dynamic heapification process, i.e. it is not immediately apparent what we are safeguarding against by forcing OSR of callees where the object was passed as an argument. This is because a callee method cannot unconditionally have taken advantage of the fact that an argument was stack allocated in the caller (i.e. it has no visibility into the caller(s)).

On a separate point, can you also spell out the sequence of steps that we will follow wrt an impending HCR for a class, any force OSRs and when they will execute, any dynamic heapifications and when they will execute all wrt the impending HCR ? i.e. I wanted us to be completely clear in that regard too.

gacholio commented 2 years ago

For now (since we don't have fine-grained peeking data), I would force the OSRs before doing the HCR to better be able to handle out of memory situations. This works because OSRing a frame is never "incorrect" - it's just a waste of performance if the HCR fails.

So, I would:

acquire exclusive VM access
walk all stacks in all threads marking any peeking frames for OSR
perform the normal HCR work
release exclusive VM access

No heapification will be done here - it will all occur dynamically as the SA objects are detected in the interpreter. Note that the actual OSR/decompile occurs when a marked frame is re-entered (returned to or exception catch).

If we assume that all SA objects are accessed via a base object pointer, I don't believe we will need to OSR for heapification - replacing the object pointers in the affected frames should be sufficient.

I think there may be an issue with the way SA objects are described in the metadata. It looks like there's a bitmap describing which object slots point to SA objects rather than heap ones, which will not work if we replace SA with heap. Perhaps we can do away with the bitfield and determine SA some other way (header bits or stack range check).

gacholio commented 2 years ago

When replacing SA objects from dynamic heapification, we will need to be careful to also update any OSR buffers that contain the SA object as well. I believe this will occur naturally using the O-slot walk in the stack walker, but will need to make sure.

To be clear, the OSR block runs and fills in the buffer at the point where the decompilation tag is added, not when the decompile actually occurs, which is why we may have OSR buffers hanging off stack frames that need to be updated.

gacholio commented 2 years ago

Is peeking only done for stack allocation? Seems to me a compilation that's made any assumptions about methods not in its inline map needs to be decompiled on HCR, regardless of whether it did any stack allocation.

gacholio commented 2 years ago

I think there may be an issue with the way SA objects are described in the metadata. It looks like there's a bitmap describing which object slots point to SA objects rather than heap ones, which will not work if we replace SA with heap. Perhaps we can do away with the bitfield and determine SA some other way (header bits or stack range check).

Actually, I don't think this will be an issue, assuming that SA objects are exactly the same shape as heap objects (which is a hard requirement for allowing them to escape). Bascially, the stack walker will report each O slot in the object to the GC, rather than reporting the object itself. This is possibly trivially less performant, but will be correct.

vijaysun-omr commented 2 years ago

There is a separate "stackAllocMap" in the GC stack atlas that should be describing the stack slots that mark the "beginning" of stack allocated objects (i.e. where the header class pointer slot is). These are not slots corresponding to variables pointing at stack allocated objects, but the stack allocated objects themselves.

As far as I know, there is no notion of stack slots pointing at stack allocated objects being marked out in any way in the GC stack maps, i.e. we have a reference slot and typically we don't know (and do not try to find out) if it is pointing to a stack allocated object or a heap allocated one. In either case we would just mark it as a collected slot and it works since the VM stack walker/GC are able to deal with it.

vijaysun-omr commented 2 years ago

Is peeking only done for stack allocation? Seems to me a compilation that's made any assumptions about methods not in its inline map needs to be decompiled on HCR, regardless of whether it did any stack allocation.

The routine that does the IL gen for peeking in the JIT is designed to report a failure when HCR is enabled. This in turn makes all optimizations that want to peek get conservative since they detect that there was some problem in generating IL that was essential to some analysis that they might otherwise have done.

The dynamic heapification addresses a specific issue that would be caused by a specific optimization, i.e. escape analysis would have done in the presence of HCR. So, we would only be changing the "fail IL gen when HCR is enabled" behavior from escape analysis once the work you are doing for this feature is completed.

gacholio commented 2 years ago

There is a separate "stackAllocMap" in the GC stack atlas that should be describing the stack slots that mark the "beginning" of stack allocated objects (i.e. where the header class pointer slot is). These are not slots corresponding to variables pointing at stack allocated objects, but the stack allocated objects themselves.

If we are adding the restriction that all SA objects must have a slot in the stack frame pointing to them, I think we can get rid of stackAllocMap completely (at the very least you'll have to set it to all 0s for this feature).

vijaysun-omr commented 2 years ago

I don't know that we are adding such a restriction as part of the work underway for the heapification feature specifically, i.e. we are not affecting what other stack slots or registers point at the stack allocated object I believe.

You may be right that we don't need to scan the stackAllocMap but this may be a statement that is true independent of the heapification feature. It depends on what all you are using the stackAllocMap for on the stack walker/VM side. It was added in when we did compressed references many years ago. If you remember from before that was done, the stack slots for the reference type fields of stack allocated objects were just described in our stack maps as any other O-slot. This needed to change at that time since the reference type fields of stack allocated objects were in compressed form (like any reference type fields of heap allocated objects) whereas reference type locals/parms appeared in uncompressed form.

I'd prefer to avoid tinkering with stackAllocMap as part of this heapification feature (just to reduce the moving parts and risk) but I can always be convinced by @gacholio to do almost anything :)

gacholio commented 2 years ago

Scanning the stackAllocMap after heapification will keep objects alive that are potentially dead (the stack slots are no long relevant once we've replaced the object pointer). This is non-fatal but could lead to mysterious leaks that might be difficult to track down.

The only way heapfication can work is if all SA objects are referred to by object pointer - we won't be heapifying objects during the decompile, so the only way the interpreted frame can see the object is via pointer.

vijaysun-omr commented 2 years ago

Scanning the stackAllocMap after heapification will keep objects alive that are potentially dead (the stack slots are no long relevant once we've replaced the object pointer). This is non-fatal but could lead to mysterious leaks that might be difficult to track down.

Is one option to null out any non-null reference fields that point to other objects in the stack allocated object after heapification is done and all referring slots have been updated to point at the heapified object ?

gacholio commented 2 years ago

Scanning the stackAllocMap after heapification will keep objects alive that are potentially dead (the stack slots are no long relevant once we've replaced the object pointer). This is non-fatal but could lead to mysterious leaks that might be difficult to track down.

Is one option to null out any non-null reference fields that point to other objects in the stack allocated object after heapification is done and all referring slots have been updated to point at the heapified object ?

Yes, we could do that, though it seems like unnecessary work (I would just memset the whole SA region to 0). The stackMapTable will be completely pointless when SA objects are referred to by pointer, which I believe is absolutely required for this feature.

gacholio commented 2 years ago

@vijaysun-omr Working on the detailed overview for the work.

Given that SA objects can point to other SA objects, does this only occur when all the SA objects in question are allocated in a single compiled body (or a caller) so you have control over the lifetime of the objects? For example:

class Pair { Thing first; Thing second; }

void m(Pair p) {
   Thing t = new Thing();
   p.first = t;
}

Does the assignment to field of p (whose provenance is unknown) prevent t from being stack allocated?

vijaysun-omr commented 2 years ago

"Yes" to both your questions.

gacholio commented 2 years ago

I’ve marked a few important items with Question and Note below.

Overview

The invariant that must be maintained here is that SA objects must never escape the thread in which they are allocated.

SA objects may point to other SA objects - SA object o allocated in method m may point to any SA object allocated by m or any caller of m (transitively).

SA object pointers are ignored by the GC - the way SA objects are walked is with some metadata called stackAllocMap, which describes the starting stack slot for SA objects. The stack walker manually walks the SA object and presents each object field to the GC individually.

HCR

The metadata will be enhanced with a flag/query indicating that the body must be invalidated and any frames running it decompiled if an HCR occurs (this is the simplest solution - it could later be enhanced to be more granular based on what’s getting redefined).

HCR will be modified so that after acquiring exclusive VM access all java stacks are walked and each frame queried. If the metadata says so, the JIT will be called to invalidate the body and a decompilation will be added to the frame. Note that the decompilation requires native memory, so there’s a possibility of graceful failure at this stage.

Question: Will it be easy for the JIT to detect bodies that are already invalidated? This would be preferable to the HCR code uniquifying the bodies before calling the JIT.

It’s convenient to do the invalidation first, since the operations do not affect correctness, so if the HCR later fails there’s no need to try and undo things. If the invalidation were to be done after the HCR, there’s a possibility it could fail, which would leave the VM in inconsistent state (reverting an otherwise successful HCR operation would be complex, error-prone work).

Two things to note here:

The OSR block is executed when the decompilation is created, but the stack rebuild does not occur until the affected frame is returned to by normal or exceptional means.
No heapification occurs during any part of the decompilation (it’s done while executing in the interpreter)

When to heapfify in the interpreter

SA objects can appear in interpreter frames under two conditions:

compiled method calls interpreter with SA object as an argument
compiled method with SA object is decompiled

So, the interpreter can not know implicitly whether an object on the java stack is SA or not.

There are two approaches to heapification:

heapify when SA object is read from the stack

This would eagerly heapify any object read from the stack, even if it’s just used to fetch a field or be passed to another method. It also means checking for SA on every object stack read, which is a great many places.

heapify when SA object is stored outside of the current thread

This would be my preferred solution, as there are far fewer escape opportunities than there are reads from the stack.

Assuming we go with the lazy solution, the following escape opportunities need to be guarded:

monitor enter on object
object stored into an instance field
object stored into a static field
object wrapped in a JNI reference
object stored in any other VM structure that's not strictly thread-local
object passed to any C API expecting j9object_t
others I may have missed

The places that create JNI refs are:

object argument passed to JNI natives
JVMTI stack queries
probably some INL natives
no doubt forgotten some

The heapification process

This is independent from the HCR work (it’s a prerequisite).

Note: It will now be possible for SA object pointers to appear in interpreted frames. It would be undesirable to decompile the frame which allocated the SA object when heapification occurs, so this adds a restriction to the compiled code: SA objects must only be referenced via an object pointer, never directly via the stack slots of the SA object. This is because heapification may replace the SA object pointer with a heap object pointer, so the compiled code must be tolerant of both, which means indirect access only.

In simplest terms, heapifying an SA object means:

allocate a new object on the heap
copy the contents of the SA object to the heap object
null the contents of the SA object
replace all references to the SA object in the current thread stack with the heap reference

Note that the only place an SA object can appear is on the stack.

Question: Can SA objects be synchronized upon? This involves extra complexity when heapifying (updating monitor table, computing the entry count, etc). I would prefer the first pass disallow stack allocation for any object that may be locked.

Because SA objects can point to other SA objects, the heapification process needs to do a "deep copy", heapifying along the way. Rougly:

j9object_t heapify(j9object_t obj)
{
   if (obj is not stack allocated) {
      return obj;
   }
   ht = new hashTable to map SA->heap;
   if (NULL == ht) {
      throw OOM;
   }
   obj = heapifyInternal(obj, ht);
   if (NULL == obj) {
      free ht;
      thow OOM;
   }
   /* No failure possible beyond this point */
   walk current thread stack replacing SA object pointers using the hash table;
   NULL all of the SA objects in the hash table;
   free ht;
   return obj;
}

j9object_t heapifyInternal(j9object_t sa, HashTable ht)
{
   if (ht contains sa) {
      return matching heap obj;
   }
   // Need to save/restore all existing mapped objects here
   // Also note stack could grow if heap events hooked - maybe disable SA if that happens
   heapObj = new heap object;
   if (failed to alloc) {
      return NULL;
   }
   add sa->heapObj mapping to ht;
   if (add failed) {
      return NULL;
   }
   clone sa into heapObj;   <— could be combined with fixup below
   for (each object field in heapObj) {
      field = heapifyInternal(field, ht);
      if (NULL == field) {
         return NULL;
      }
   }
   return heapObj;
}

gacholio commented 2 years ago

We could probably lift the restriction of no direct stack access for SA if we want to decompile the frame that allocated the slots.

vijaysun-omr commented 2 years ago

Thanks for the detailed description @gacholio

I think "SA objects may point to other SA objects - SA object o allocated in method m may point to any SA object allocated by m or any caller of m (transitively)." is really more accurately stated as "SA objects may point to other SA objects - SA object o allocated in method m may point to any SA object allocated by m or any callee of m (transitively) that gets inlined into m." In general, beyond the scope of a single compiled method (body), there is no attempt made to have one SA object point at another, i.e. there is no connection/awareness between a stack allocation that may have occurred in a caller and what stack allocation may be done in the compiled method that is the callee.

We may need to discuss "The OSR block is executed when the decompilation is created".

What I was picturing was the following way in which this would be handled: 1) During the compilation of method foo escape analysis peeks inside a method call to bar (that itself may call other methods, e.g. goo) and relies on the result of this peeking to stack allocate some object O. In the course of peeking into bar, escape analysis also keeps track of all the other callees it is transitively peeking, e.g. goo 2) Since it stack allocated O based on the peeking into the call to bar (and it knows which methods it has peeked transitively), escape analysis adds a NOPed guard right after the call to bar with the assumptions underlying the NOPed guard being that : if any method that was involved in the transitive peek, e.g. goo gets redefined then the NOPed guard gets patched at run time and the back up (slow) path is executed that has a call to induce OSR, i.e. call the relevant OSR blocks.

That is, we may not need a) to plan for an involuntary OSR to be done b) regardless of where we are in compiled method foo that is on the stack. i.e. in the above picture, we do a voluntary OSR (induced only if something relevant changed) and this only happens after we return from the top level peeked call in foo, i.e. bar in the above example.

The reason for why I was thinking of it this way: adding more OSR points adds more overhead in the JIT, both with respect to compile time and with respect to the optimization challenge that is posed by the bookkeeping to manage additional OSR points, i.e. the generated code is often worse even on the fast (not doing OSR transition) code path. The places where OSR gets used, the tradeoff is that the expected benefit of doing the optimization relying on OSR (e.g. static final field folding OR just allowing a user to change classes dynamically) outweighs the cost of all the OSR bookkeeping at compile/run time. So. by restricting the OSR induction to precisely those top level method calls (e.g. bar) that were peeked, we increase the likelihood of the stack allocation being a net win from a performance sense and I believe (I think we should try and make sure of this) there are no functional issues with doing so.

gacholio commented 2 years ago

is really more accurately stated as "SA objects may point to other SA objects - SA object o allocated in method m may point to any SA object allocated by m or any callee of m (transitively) that gets inlined into m."

I meant the opposite - it's OK for an SA object to point to an SA object that was allocated further down the stack (callers of m), but not further up the stack (callees of m). With inlining, you can blur this distinction a bit, but still need to ensure that no object can point to another object that goes out of scope first.

gacholio commented 2 years ago

We may need to discuss "The OSR block is executed when the decompilation is created".

This has always been true, and should not matter - the whole point of the decompilation is that the compiled code is not allowed to run again, so it shouldn't matter whether the OSR is performed at the time of request (on the requesting thread) or the time of the decompilation (on the target thread).

I have no particular objection to moving the OSR, but I'm not sure what the justification would be.

vijaysun-omr commented 2 years ago

I was mentioning the reason for allowing the JITed code to do a voluntary OSR (in my prior comment) after the peeked call. Does that sound okay to you ?

gacholio commented 2 years ago

@vijaysun-omr If you believe the voluntary OSR solution will perform better, then that sounds like a lot less work for the VM (most of the work will be the heapification portion). Am I correct that the VM will essentially need to do nothing new for HCR? We already report the set of changed classes/methods to the JIT, which would seem to be the place to do the invalidation of the guards. Which methods are currently running on a stack is no longer relevant as you will need to invalidate all guards associated with the changes made by HCR.

Do the guards not create optimization barriers that would not appear if the OSR points were expanded?

vijaysun-omr commented 2 years ago

@vijaysun-omr If you believe the voluntary OSR solution will perform better, then that sounds like a lot less work for the VM (most of the work will be the heapification portion). Am I correct that the VM will essentially need to do nothing new for HCR? We already report the set of changed classes/methods to the JIT, which would seem to be the place to do the invalidation of the guards. Which methods are currently running on a stack is no longer relevant as you will need to invalidate all guards associated with the changes made by HCR.

When an HCR occurs, you would need to do one additional task that is not being done today, namely walk all the thread stacks and heapify stack allocated objects allocated in frames marked with the "I peeked and stack allocated an object" flag in the MMD. Of course these heapifications could require updating of stack slots pointing at this stack allocated object from other frames on the same thread since no stack slot in any frame can point to a given stack allocated object once it has been heapified. Otherwise, I think I agree that you do not need to do anything over and above what you do currently since we can handle the invalidation of the relevant compiled bodies (for future invocations) as well as the invalidation of the relevant guards to induce OSR from bodies that are affected by the HCR in question (for invocations in progress) in the hooks called when HCR occurs.

Do the guards not create optimization barriers that would not appear if the OSR points were expanded?

If you mean "if the OSR points were expanded to the set where involuntary OSR could occur" when you say "if the OSR points were expanded", then my answer is that while the guards would be optimization barriers in a sense, this is still better than the alternative since 1) the set of points where involuntary OSR may occur is a large set, certainly larger than the set of points where we would choose to do voluntary (induced) OSR, and 2) those involuntary OSR points get represented via exception edges to OSR catch blocks that are just as much a barrier to optimization (if not more so) as the explicit control flow to induce OSR is.

gacholio commented 2 years ago

Heapification occurs in the interpreter after decompilation, not during HCR itself.

vijaysun-omr commented 2 years ago

I don't think heapification in the interpreter would work in the following example (in a solution where we also don't want involuntary OSR) :

void foo() {

A obj = new A(); // this object gets stack allocated based on peeking of bar
...
bar(obj);
...
other code that uses obj;
...
}

void bar(A a) {
   ...
   goo1(); // causes HCR of method goo2
   goo2(a);
   ...
}

void goo2(Object a) {} // Original implementation that was safe (did not cause the argument to escape) when we peeked at compile time
void goo2(Object a) { some_static_field = a; } // New implementation after HCR that is unsafe (causes the argument to escape)

If we did not heapify before we allowed the new goo2 to execute then some_static_field would point at a stack allocated object. Nothing that is done in the interpreter alone is sufficient to safeguard against this since goo2 may get compiled at some point.

vijaysun-omr commented 2 years ago

In the solution I was thinking, the object would get heapified before goo2 implementation is replaced. This would also be the point where foo is invalidated and a guard that we would add after the call to bar is patched leading to an OSR being induced when we get back to foo after we return from bar. The patching of the guard makes it possible to optimize the uses of obj in foo after the call to bar since we know that we will OSR out of the compiled body for foo immediately after the call to bar returns if there was HCR in the meantime.

gacholio commented 2 years ago

Heapifying beforehand will involve halting all of the threads in a new exclusive VM access section due to the fact that object allocation can not take place under exclusive.

I think we'll need to do something like:

allocate object map
acquire exclusive
for all threads
   mark thread halted for HCR
   for all frames
      if method peeked, add all SA objects in frame to object map
release exclusive
heapify all objects in map using previously described deep copy algorithm
acquire exclusive
for all threads
   unmark thread halted for HCR
   for all frames
      replace any SA object pointers in object map with heap copy
free object map
existing HCR operations
release exclusive

Is this what you had in mind?

In the above code, I'm adding all SA objects discovered in an O slot in a peeking frame. I think it would be easy enough to narrow that check so that only objects allocated in the frame itself are noted.

vijaysun-omr commented 2 years ago

Yes that is what I was thinking where existing HCR operations gets augmented on the JIT side to do those invalidations I mentioned in my prior comment. But your flow above captured what I had in mind from the VM perspective.

Attn : @hzongaro and @0xdaryl Please think through this design and the examples to make sure we are not missing anything (since we appear to be converging in this discussion).

vijaysun-omr commented 2 years ago

In the above code, I'm adding all SA objects discovered in an O slot in a peeking frame. I think it would be easy enough to narrow that check so that only objects allocated in the frame itself are noted.

Yes, I think it may be preferable to narrow that check so that only objects allocated in the frame itself are noted.

In theory it is also possible to distinguish between different stack allocated objects if they were stack allocated based on peeking different calls, i.e. redefining a particular peeked method need not result in heapification of all objects in a given frame that did the peeking. But we do not need to support this to begin with, i.e. keep it simple and only add that complexity if we feel we need it (may never happen).

gacholio commented 2 years ago

Doing the fancier version would be significantly more complex and would provide no benefit. This work will allow more stack allocation when HCR is a possibility - if HCR is used, there's no real savings for not heapifying objects.

gacholio commented 2 years ago

If there are no worries about this design, I'll start working on the VM side.

@vijaysun-omr To test my work, I will need the JIT to obey the assertion that SA objects are only accessed via pointer with no direct access to the individual object slots. Can you either assert that this is already true, or give me a patch to make it so?

vijaysun-omr commented 2 years ago

I believe @0xdaryl had a patch at one point to prevent direct access to individual slots within the object. So he will try to locate that.

@hzongaro and @0xdaryl may not have internalized the discussion but I expect they will, likely by later this week. I doubt we would change course dramatically though, so I personally feel you can start on the VM side but keep an eye on the design for comments from them this week.

gacholio commented 2 years ago

Work is taking place here: https://github.com/gacholio/openj9/tree/sa

eclipse-openj9 / openj9

Reduce HCR impact on the effectiveness of escape analysis #13192