dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.85k stars 4.62k forks source link

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

Open BruceForstall opened 4 years ago

BruceForstall commented 4 years ago

During the original development of the arm32 product, it was decided that the lr register would always be stored to the stack in the prolog to support return address hijacking for GC suspension. There is this comment in the JIT (CodeGen::genPushCalleeSavedRegisters()):

// It may be possible to skip pushing/popping lr for leaf methods. However, such optimization would require
// changes in GC suspension architecture.
//
// We would need to guarantee that a tight loop calling a virtual leaf method can be suspended for GC. Today, we
// generate partially interruptible code for both the method that contains the tight loop with the call and the leaf
// method. GC suspension depends on return address hijacking in this case. Return address hijacking depends
// on the return address to be saved on the stack. If we skipped pushing/popping lr, the return address would never
// be saved on the stack and the GC suspension would time out.
//
// So if we wanted to skip pushing pushing/popping lr for leaf frames, we would also need to do one of
// the following to make GC suspension work in the above scenario:
// - Make return address hijacking work even when lr is not saved on the stack.
// - Generate fully interruptible code for loops that contains calls
// - Generate fully interruptible code for leaf methods
//
// Given the limited benefit from this optimization (<10k for mscorlib NGen image), the extra complexity
// is not worth it.

This decision was maintained when arm64 support was added.

Should this decision be reconsidered?

For arm64, empty methods have this minimum code size:

stp     fp, lr, [sp,#-16]!
mov     fp, sp
ldp     fp, lr, [sp],#16
ret     lr

Question: if function A was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?

Even simple leaf methods require this prolog/epilog, e.g.:

G_M1350_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp

G_M1350_IG02:
        F9401400          ldr     x0, [x0,#40]

G_M1350_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

One argument to keep this is that simple leaf methods are likely to be inlined into their callers and therefore this prolog/epilog overhead isn't encountered in real programs.

The overhead measurement (<10k in the above comment for mscorlib => System.Private.CoreLib) could be recomputed for arm64 and the current situation, and include measurement of other libraries as well.

There might also be implications to not saving and establishing fp on debugging, stack walking, etc.

Comments? @jkotas @AndyAyersMS @kunalspathak

category:question theme:prolog-epilog skill-level:expert cost:large

jkotas commented 4 years ago

Even simple leaf methods require this prolog/epilog,

Does the simple method in your example really need to establish fp? fp should not be required for this (one less instruction).

if function A was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?

Yes, the current JIT heuristics around GC suspension really only guarantee that the GC suspension has a chance of succeeding. They do not guarantee that the GC suspension happens in timely manner. It is not hard to construct real-world looking code that hits issues like this. It is something we should be working towards fixing, but I am not sure we would want to fix this by generating fully interruptible code for all loops. It needs data about the tradeoffs.

Should this decision be reconsidered?

Sounds fine to me. It needs data about the tradeoffs.

jkotas commented 4 years ago

E.g. interesting data point would be: If we stopped generating the frame for small functions (ie smaller code) and started generating fully interruptible code for all loops (ie bigger GCInfo), is the CoreLib native image going to be smaller or bigger overall?

BruceForstall commented 4 years ago

Does the simple method in your example really need to establish fp?

It seems like it doesn't need to: I think we establish it for Windows ETW and stack walking (maybe just gdb debugging) on Linux. I'm not sure if it's required for correctness.

E.g. interesting data point would be...

I think fully-versus-partially interruptible was on a function basis; we don't do it just for particular loops or code regions. The question is whether the GC info supports it on a region basis.

Assuming it is supported (or could be engineered), then gathering that data would be interesting.

Question: as I understand it, GC suspension is, very roughly:

while (true) {
  stop all threads
  if (all threads are at a GC safe point)
    break;
  if (we're tried this multiple times and haven't succeeded) // or maybe immediately?
    set up return address hijacks
  restart all threads
  wait some amount of time
}

How many "tries" (or loops) normally happen before we succeed? (And how costly is it?) Would changing our fully interruptible / call / loop strategy make GC suspension faster and more reliable?

kunalspathak commented 4 years ago

Just to share one data point - In the R2R generated code of framework libraries for x64, 23% of methods don't have prolog/epilog. These methods can be optimized for arm64 as well.

jkotas commented 4 years ago

The question is whether the GC info supports it on a region basis.

I believe that the GC Info encoding (the non-x86 version at least) does support it on region basis. Look for DefineInterruptibleRange. It was used by JIT64.

Assuming it is supported (or could be engineered), then gathering that data would be interesting.

You can gather the data even for setting on the whole method: It will give us upper bound.

GC suspension is

Here is a more accurate version:

do {
  retry = false;
  foreach thread {
      if (thread in preemptive mode or stopped at GC safe point) continue;
      stop the thread
      if (in fully interruptible region) continue; // at GC safe point
      try to setup return address hijack
      resume thread
      retry = true;
  }
} while (!retry);

How many "tries" (or loops) normally happen before we succeed?

The typical case is that we go through one iteration of the loop, and by the time we get to the second one iteration all threads hit return address hijack or some other GC safe spot. The typical methods are small and so they will hit return address hijack quickly.

The fully interruptible is important for the long running loops. It is important for predictable performance - the 98+% latencies, etc.

VSadov commented 2 months ago

One interesting thought by @filipnavara is that partially interruptible leaf methods may not need to support external unhijacking as, once hijacked in such method, a thread will have to inevitably unhijack itself - either by hitting the hijack or exiting the frame exceptionally (i.e. deref a NULL case)

There is never a need to move a hijack lower, if it is already on a leaf call, and that could be useful property to consider.