Open filipnavara opened 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
For reference, here's how the custom prolog unwinding code would look like in C.
First, we would need to detect that we are in prolog:
#if defined(TARGET_AMD64) && defined(TARGET_OSX)
// Compact unwinding on macOS cannot properly handle unwinding the function prolog
// so we have to handle it explicitly
if ((PTR_UInt8)pRegisterSet->IP < (PTR_UInt8)pNativeMethodInfo->pMethodStartAddress + decoder.GetPrologSize())
{
return UnwindProlog(pMethodInfo, pRegisterSet, ppvRetAddrLocation);
}
#endif
...and we would need couple of definitions/macros (shared with existing UnixNativeCodeManager::TrailingEpilogueInstructionsCount
code):
#ifdef TARGET_AMD64
#define SIZE64_PREFIX 0x48
#define ADD_IMM8_OP 0x83
#define ADD_IMM32_OP 0x81
#define JMP_IMM8_OP 0xeb
#define JMP_IMM32_OP 0xe9
#define JMP_IND_OP 0xff
#define LEA_OP 0x8d
#define REPNE_PREFIX 0xf2
#define REP_PREFIX 0xf3
#define POP_OP 0x58
#define PUSH_OP 0x50
#define RET_OP 0xc3
#define RET_OP_2 0xc2
#define INT3_OP 0xcc
#define IS_REX_PREFIX(x) (((x) & 0xf0) == 0x40)
#endif
...and finally the unwinding method:
bool UnixNativeCodeManager::UnwindProlog(MethodInfo * pMethodInfo,
REGDISPLAY * pRegisterSet,
PTR_PTR_VOID * ppvRetAddrLocation)
{
#if defined(TARGET_AMD64)
UnixNativeMethodInfo* pNativeMethodInfo = (UnixNativeMethodInfo*)pMethodInfo;
uint8_t* pNextByte = (uint8_t*)pNativeMethodInfo->pMethodStartAddress;
uint32_t stackOffset = 0;
while (pNextByte < (uint8_t*)pRegisterSet->IP)
{
if ((pNextByte[0] & 0xf8) == PUSH_OP)
{
stackOffset += 8;
pNextByte += 1;
}
else if (IS_REX_PREFIX(pNextByte[0]) && ((pNextByte[1] & 0xf8) == PUSH_OP))
{
stackOffset += 8;
pNextByte += 2;
}
else if ((pNextByte[0] & 0xf8) == SIZE64_PREFIX &&
pNextByte[1] == ADD_IMM8_OP &&
pNextByte[2] == 0xec)
{
// sub rsp, imm8
stackOffset += pNextByte[3];
pNextByte += 4;
}
else if ((pNextByte[0] & 0xf8) == SIZE64_PREFIX &&
pNextByte[1] == ADD_IMM32_OP &&
pNextByte[2] == 0xec)
{
// sub rsp, imm32
stackOffset +=
(uint32_t)pNextByte[3] |
((uint32_t)pNextByte[4] << 8) |
((uint32_t)pNextByte[5] << 16) |
((uint32_t)pNextByte[6] << 24);
pNextByte += 7;
}
else
{
// Bail out for anything that we cannot handle. This could be a breakpoint
// (int 3) inserted by a debugger, or some more complicated prolog pattern
// like the stack probing:
//
// lea r11, [rsp-XXX]
// call __chkstk
// mov rsp, r11
//
// Additionally, these sequences may establish the prolog frame but we don't
// need to handle them since they are always the last instruction of the
// prolog and thus regular unwinding should work:
//
// lea rbp, [rsp+IMM8]
// lea rbp, [rsp+IMM32]
return false;
}
}
*ppvRetAddrLocation = (PTR_PTR_VOID)(pRegisterSet->GetSP() + stackOffset);
return true;
#else
PORTABILITY_ASSERT("UnwindProlog");
#endif
}
I looked into prototyping this on ARM64 Apple platforms: https://github.com/filipnavara/runtime/pull/new/arm64-compact-unwind
The branch is on top of the frameless prototype from issue https://github.com/dotnet/runtime/issues/35274#issuecomment-2317616731, only the last commit contains the JIT and ObjWriter changes relevant to this PR. While the two changes are somewhat orthogonal I also implemented more generic algorithm for computing the compact unwinding code and the frameless methods provided additional test cases.
The rough overview of the changes:
SetSaveFpLrWithAllCalleeSavedRegisters
in the appropriate place and in the right conditions.Challenges:
SetSaveFpLrWithAllCalleeSavedRegisters
needs to be called) then we can eliminate this overhead by falling back to the current frame layouts. Presumably this won't affect too many methods, but I have observed it to be disproportionally affecting code generated from XAML which uses ton of local variables.So, how well does it perform?
To give you an idea of how big is the difference I recompiled an empty .NET MAUI app with the above changes. The baseline was compiled with frameless methods, which saves around 27Kb of code size compared to main
.NET 10 as of this writing. With the change in https://github.com/dotnet/runtime/commit/3f208ce5de3197519f01fdc922b7e9d1b6738acc the size of the code section increased by 183,696 bytes (+3.4%), the size of the unwinding information decreased by 884,932 bytes (-90%).
The code size increase was extremely disproportional. For example, method _maui_empty_app___XamlGeneratedCode_____Type055F947991421E4D__InitializeComponent
increased by 53.5Kb. It has extremely large frame and the assigned variable addresses ended up with pretty much the worst case offsets that are not representable with immediate offset
in the ARM64 instructions. I believe this can be mitigated to some point but the conservative solution is to predict large frame (which end up being represented as "frame type 5") and fallback to the current frame layout. I'm not quite sure how to nicely implement this in the JIT but some quick and dirty hacks showed that it can reduce the code size increase significantly.
One more thought - the code size increase would likely be possible to mitigate with enabling support for double-aligned frames, ie. access locals through SP if possible. That would be a larger change though.
UPD: Maybe it would not be such a huge change after all, seem like it could be done in lvaFixVirtualFrameOffsets
.
Thanks for looking into this!
Cc @VSadov since he knows about unwinding
I have slightly more refined version of the prototype: https://github.com/filipnavara/runtime/tree/arm64-compact-unwind-1.
I managed to mitigate most of the code size increase (aside from the +4 bytes for prolog with temporaries/locals and other code size changes related to alignment). Turns out ARM32 already has the optimization for turning FP-based offsets into SP-based offsets in lvaFrameAddress
so it's possible to mimic the logic there without going through all the trouble of enabling full double-aligned frame support.
The conservative condition is enabling the Apple-style prologs for all methods with isFramePointerRequired() == false
. It essentially excludes all methods with exception handling or localloc
.
With these tweaks the stats for dotnet new maui
app are as follows:
We could further tweak the heuristic to opt-in smaller methods with exception handling into the Apple prologs. This can likely save another 10% in size of the DWARF unwinding data but it's a more nuanced heuristic to get right.
Latest branch: https://github.com/dotnet/runtime/compare/main...filipnavara:arm64-compact-unwind-3?expand=1
It passed the CI.
Apple platforms use compact unwinding information to efficiently encode information on how to do stack unwinding. Unlike the DWARF CFI information that is currently used by NativeAOT on macOS and Linux the compact unwinding information is smaller. It also does not encode enough information to do asynchronous unwinding in prolog/epilog of the functions. The benefit of using the compact unwinding codes would be smaller size of the resulting binaries.
Upon investigation I found that ILCompiler already emits the DWARF CFI only for prologs and not for epilogs. UnixNativeCodeManager handles the epilogs by doing code inspection. Similar approach can be employed to unwind the prologs. As an experiment I took an osx-x64 object file produced by the NativeAOT compilation process and for every function I compared the results of trivial prolog x64 code walk with the offsets in the actual DWARF CFI code. For vast majority of the cases the prolog only uses two different instructions (
push REG
andsub RSP, <value>
) before establishing theRBP
frame that can already be processed with the compact unwinding information. Only one method uses more complex pattern to allocate a frame that's larger than page size and where stack probing is needed. It would be simple to recognize that pattern too.To be able to use the combination of custom prolog unwinding and the compact unwinding for method body we would need to know the size of the prolog. Unfortunately that information is currently not stored anywhere. The
GcInfo
structure can optionally store it in some cases but for majority of uses it's not present at the moment. We would likely need to store it as extra byte in the LSDA structure.It's not obvious whether using the compact unwinding would be a clear win. It adds code complexity that is specific to a single platform. I don't have any numbers at the moment to show how much space could be saved by the compact encoding in comparison to the current DWARF CFI encoding.