ARM64&Unix: New unwinder

dotnet / corert

This repo contains CoreRT, an experimental .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying compiler toolchain.

http://dot.net

MIT License

2.91k stars 510 forks source link

ARM64&Unix: New unwinder #8345

Closed RalfKornmannEnvision closed 3 years ago

RalfKornmannEnvision commented 4 years ago

Implements a custom unwinder for ARM64 that does not need the generic CFI based libunwind functions. This way we only need to loop just once over all used registers instead of doing the complicated CFI process.This will improve the performances for everything that needs a stack walk.

Additional unwind data are generated and stored as part of the LSDA. -4 byte for a relative offset -3 byte for the CFA -3 byte for each register that is part of the prolog save -1 byte end marker

We still keep the CFI data for the debugger. Maybe the could be removed for a build without debug information.

A possible future optimization would be to introduce a compact format for the information used by the most common cases. This could be stored in the relative offset. But compared to the replacement of libunwind the wins would be rather small.

The same can be done for any other supported architecture but each one will require a custom unwind function

jkotas commented 4 years ago

Could you please share performance numbers:

How much bigger (%) is the native AOT code footprint with this change?
What are the median GC pause time you see before and after this change?

RalfKornmannEnvision commented 4 years ago

Currently grow by ~2.5% if I keep the cfi unwind data and ~1% if they are removed (only the debugger should need them). Seems I need to improve the encoding some more. Need to find a good way to measure the performance difference.

RalfKornmannEnvision commented 4 years ago

Did a quick check. Should be possible to get the unwind info smaller than they current CFI Data used.

RalfKornmannEnvision commented 4 years ago

Added a compact encoding. As it can cover most of the stack frames RuyJIT generates the average size of the unwind data per function/funclet (for my test program) is now 2.11 bytes. For the same program the DWARF based unwinder needs 8.46 bytes per function/funclet. If we remove the dwarf data in a release build the executable will get smaller than before. For my test programm 108KB. For larger programs with more functions the savings should be larger.

RalfKornmannEnvision commented 4 years ago

Performances for a GC.Collect(0) improves from .9 ms to 0.51 ms

jkotas commented 4 years ago

(I have not forgotten about this PR. I want to take a detailed look at it that I have not found time for yet.)

RalfKornmannEnvision commented 4 years ago

Take your time.

In the meantime I noticed that a big chunk of the improvement on Android is not the unwinding itself but to reused the already searched LSDA. But on the switch I see a similar improvement just based on the new unwind code as getting the LSDA is much faster there.

I have a second change ready for another PR that does improve the search for the LSDA independent from the CPU architecture. But I don't now if Android is just a very bad chase or if this happens on every *NIX. If I combine this PR and the other change times go down to 0.11ms per GC.Collect(0)