[NativeAOT] Evaluate use/benefits of compact unwinding on osx-x64 and osx-arm64

filipnavara commented 2 years ago

Apple platforms use compact unwinding information to efficiently encode information on how to do stack unwinding. Unlike the DWARF CFI information that is currently used by NativeAOT on macOS and Linux the compact unwinding information is smaller. It also does not encode enough information to do asynchronous unwinding in prolog/epilog of the functions. The benefit of using the compact unwinding codes would be smaller size of the resulting binaries.

Upon investigation I found that ILCompiler already emits the DWARF CFI only for prologs and not for epilogs. UnixNativeCodeManager handles the epilogs by doing code inspection. Similar approach can be employed to unwind the prologs. As an experiment I took an osx-x64 object file produced by the NativeAOT compilation process and for every function I compared the results of trivial prolog x64 code walk with the offsets in the actual DWARF CFI code. For vast majority of the cases the prolog only uses two different instructions (push REG and sub RSP, <value>) before establishing the RBP frame that can already be processed with the compact unwinding information. Only one method uses more complex pattern to allocate a frame that's larger than page size and where stack probing is needed. It would be simple to recognize that pattern too.

To be able to use the combination of custom prolog unwinding and the compact unwinding for method body we would need to know the size of the prolog. Unfortunately that information is currently not stored anywhere. The GcInfo structure can optionally store it in some cases but for majority of uses it's not present at the moment. We would likely need to store it as extra byte in the LSDA structure.

It's not obvious whether using the compact unwinding would be a clear win. It adds code complexity that is specific to a single platform. I don't have any numbers at the moment to show how much space could be saved by the compact encoding in comparison to the current DWARF CFI encoding.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

filipnavara commented 2 years ago

For reference, here's how the custom prolog unwinding code would look like in C.

First, we would need to detect that we are in prolog:

#if defined(TARGET_AMD64) && defined(TARGET_OSX)
    // Compact unwinding on macOS cannot properly handle unwinding the function prolog
    // so we have to handle it explicitly
    if ((PTR_UInt8)pRegisterSet->IP < (PTR_UInt8)pNativeMethodInfo->pMethodStartAddress + decoder.GetPrologSize())
    {
        return UnwindProlog(pMethodInfo, pRegisterSet, ppvRetAddrLocation);
    }
#endif

...and we would need couple of definitions/macros (shared with existing UnixNativeCodeManager::TrailingEpilogueInstructionsCount code):

#ifdef TARGET_AMD64

#define SIZE64_PREFIX 0x48
#define ADD_IMM8_OP 0x83
#define ADD_IMM32_OP 0x81
#define JMP_IMM8_OP 0xeb
#define JMP_IMM32_OP 0xe9
#define JMP_IND_OP 0xff
#define LEA_OP 0x8d
#define REPNE_PREFIX 0xf2
#define REP_PREFIX 0xf3
#define POP_OP 0x58
#define PUSH_OP 0x50
#define RET_OP 0xc3
#define RET_OP_2 0xc2
#define INT3_OP 0xcc

#define IS_REX_PREFIX(x) (((x) & 0xf0) == 0x40)

#endif

...and finally the unwinding method:

bool UnixNativeCodeManager::UnwindProlog(MethodInfo *    pMethodInfo,
                                         REGDISPLAY *    pRegisterSet,
                                         PTR_PTR_VOID *  ppvRetAddrLocation)
{
#if defined(TARGET_AMD64)
    UnixNativeMethodInfo* pNativeMethodInfo = (UnixNativeMethodInfo*)pMethodInfo;
    uint8_t* pNextByte = (uint8_t*)pNativeMethodInfo->pMethodStartAddress;
    uint32_t stackOffset = 0;

    while (pNextByte < (uint8_t*)pRegisterSet->IP)
    {
        if ((pNextByte[0] & 0xf8) == PUSH_OP)
        {
            stackOffset += 8;
            pNextByte += 1;
        }
        else if (IS_REX_PREFIX(pNextByte[0]) && ((pNextByte[1] & 0xf8) == PUSH_OP))
        {
            stackOffset += 8;
            pNextByte += 2;
        }
        else if ((pNextByte[0] & 0xf8) == SIZE64_PREFIX &&
                 pNextByte[1] == ADD_IMM8_OP &&
                 pNextByte[2] == 0xec)
        {
            // sub rsp, imm8
            stackOffset += pNextByte[3];
            pNextByte += 4;
        }
        else if ((pNextByte[0] & 0xf8) == SIZE64_PREFIX &&
                 pNextByte[1] == ADD_IMM32_OP &&
                 pNextByte[2] == 0xec)
        {
            // sub rsp, imm32
            stackOffset +=
                (uint32_t)pNextByte[3] |
                ((uint32_t)pNextByte[4] << 8) |
                ((uint32_t)pNextByte[5] << 16) |
                ((uint32_t)pNextByte[6] << 24);
            pNextByte += 7;
        }
        else
        {
            // Bail out for anything that we cannot handle. This could be a breakpoint
            // (int 3) inserted by a debugger, or some more complicated prolog pattern
            // like the stack probing:
            //
            //     lea r11, [rsp-XXX]
            //     call __chkstk
            //     mov rsp, r11
            //
            // Additionally, these sequences may establish the prolog frame but we don't
            // need to handle them since they are always the last instruction of the
            // prolog and thus regular unwinding should work:
            //
            //     lea rbp, [rsp+IMM8]
            //     lea rbp, [rsp+IMM32]
            return false;
        }
    }

    *ppvRetAddrLocation = (PTR_PTR_VOID)(pRegisterSet->GetSP() + stackOffset);
    return true;
#else
    PORTABILITY_ASSERT("UnwindProlog");
#endif
}

filipnavara commented 2 months ago

I looked into prototyping this on ARM64 Apple platforms: https://github.com/filipnavara/runtime/pull/new/arm64-compact-unwind

The branch is on top of the frameless prototype from issue https://github.com/dotnet/runtime/issues/35274#issuecomment-2317616731, only the last commit contains the JIT and ObjWriter changes relevant to this PR. While the two changes are somewhat orthogonal I also implemented more generic algorithm for computing the compact unwinding code and the frameless methods provided additional test cases.

The rough overview of the changes:

Generate frame type 4/5 with FP/LR saved on the top. The support for this already exists in the JIT, so we just need to call SetSaveFpLrWithAllCalleeSavedRegisters in the appropriate place and in the right conditions.
Try to store the callee saved registers in pairs even if the other register from the pair is unused. While it may sound wasteful this doesn't really increase code size, and likely doesn't impact speed on most processors either. In rare combination of odd count of both floating point and integer register spills it may consume 16 bytes more on stack.
Save the X19-X28, D8-D15 callee saved registers on the stack in opposite order.

Challenges:

For large frames, this layout may lead to code size increase. In some cases it may be quite significant. If we could identify these cases early enough (in relation to when SetSaveFpLrWithAllCalleeSavedRegisters needs to be called) then we can eliminate this overhead by falling back to the current frame layouts. Presumably this won't affect too many methods, but I have observed it to be disproportionally affecting code generated from XAML which uses ton of local variables.
Many funclet prologs save FP/LR pair on the stack but don't use/set the frame pointer. This is not expressible with the compact unwinding codes.

So, how well does it perform?

To give you an idea of how big is the difference I recompiled an empty .NET MAUI app with the above changes. The baseline was compiled with frameless methods, which saves around 27Kb of code size compared to main .NET 10 as of this writing. With the change in https://github.com/dotnet/runtime/commit/3f208ce5de3197519f01fdc922b7e9d1b6738acc the size of the code section increased by 183,696 bytes (+3.4%), the size of the unwinding information decreased by 884,932 bytes (-90%).

The code size increase was extremely disproportional. For example, method _maui_empty_app___XamlGeneratedCode_____Type055F947991421E4D__InitializeComponent increased by 53.5Kb. It has extremely large frame and the assigned variable addresses ended up with pretty much the worst case offsets that are not representable with immediate offset in the ARM64 instructions. I believe this can be mitigated to some point but the conservative solution is to predict large frame (which end up being represented as "frame type 5") and fallback to the current frame layout. I'm not quite sure how to nicely implement this in the JIT but some quick and dirty hacks showed that it can reduce the code size increase significantly.

filipnavara commented 2 months ago

One more thought - the code size increase would likely be possible to mitigate with enabling support for double-aligned frames, ie. access locals through SP if possible. That would be a larger change though.

~~UPD: Maybe it would not be such a huge change after all, seem like it could be done in lvaFixVirtualFrameOffsets.~~

MichalStrehovsky commented 2 months ago

Thanks for looking into this!

Cc @VSadov since he knows about unwinding

filipnavara commented 2 months ago

I have slightly more refined version of the prototype: https://github.com/filipnavara/runtime/tree/arm64-compact-unwind-1.

I managed to mitigate most of the code size increase (aside from the +4 bytes for prolog with temporaries/locals and other code size changes related to alignment). Turns out ARM32 already has the optimization for turning FP-based offsets into SP-based offsets in lvaFrameAddress so it's possible to mimic the logic there without going through all the trouble of enabling full double-aligned frame support.

The conservative condition is enabling the Apple-style prologs for all methods with isFramePointerRequired() == false. It essentially excludes all methods with exception handling or localloc.

With these tweaks the stats for dotnet new maui app are as follows:

26171 methods get represented by compact unwinding
5406 methods get represented by DWARF unwinding (incl. funclets that are counted as separate methods, about 10% of this number)
742Kb (77%) of the DWARF unwinding section gets eliminated.
48Kb code size increase (becomes about half with disabled loop alignment).

We could further tweak the heuristic to opt-in smaller methods with exception handling into the Apple prologs. This can likely save another 10% in size of the DWARF unwinding data but it's a more nuanced heuristic to get right.

filipnavara commented 1 month ago

Latest branch: https://github.com/dotnet/runtime/compare/main...filipnavara:arm64-compact-unwind-3?expand=1

It passed the CI.

dotnet / runtime

[NativeAOT] Evaluate use/benefits of compact unwinding on osx-x64 and osx-arm64 #76371