Closed RalfKornmannEnvision closed 3 years ago
Could you please share performance numbers:
%
) is the native AOT code footprint with this change?Currently grow by ~2.5% if I keep the cfi unwind data and ~1% if they are removed (only the debugger should need them). Seems I need to improve the encoding some more. Need to find a good way to measure the performance difference.
Did a quick check. Should be possible to get the unwind info smaller than they current CFI Data used.
Added a compact encoding. As it can cover most of the stack frames RuyJIT generates the average size of the unwind data per function/funclet (for my test program) is now 2.11 bytes. For the same program the DWARF based unwinder needs 8.46 bytes per function/funclet. If we remove the dwarf data in a release build the executable will get smaller than before. For my test programm 108KB. For larger programs with more functions the savings should be larger.
Performances for a GC.Collect(0) improves from .9 ms to 0.51 ms
(I have not forgotten about this PR. I want to take a detailed look at it that I have not found time for yet.)
Take your time.
In the meantime I noticed that a big chunk of the improvement on Android is not the unwinding itself but to reused the already searched LSDA. But on the switch I see a similar improvement just based on the new unwind code as getting the LSDA is much faster there.
I have a second change ready for another PR that does improve the search for the LSDA independent from the CPU architecture. But I don't now if Android is just a very bad chase or if this happens on every *NIX. If I combine this PR and the other change times go down to 0.11ms per GC.Collect(0)
Implements a custom unwinder for ARM64 that does not need the generic CFI based libunwind functions. This way we only need to loop just once over all used registers instead of doing the complicated CFI process.This will improve the performances for everything that needs a stack walk.
Additional unwind data are generated and stored as part of the LSDA. -4 byte for a relative offset -3 byte for the CFA -3 byte for each register that is part of the prolog save -1 byte end marker
We still keep the CFI data for the debugger. Maybe the could be removed for a build without debug information.
A possible future optimization would be to introduce a compact format for the information used by the most common cases. This could be stored in the relative offset. But compared to the replacement of libunwind the wins would be rather small.
The same can be done for any other supported architecture but each one will require a custom unwind function