Open filipnavara opened 1 year ago
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.
Author: | filipnavara |
---|---|
Assignees: | - |
Labels: | `os-mac-os-x`, `area-NativeAOT-coreclr` |
Milestone: | - |
I guess we have three options how to deal with it:
Another option is to compensate for this issue in the libunwind implementation. We can loop over all candidates that match the hint.
Another option is to compensate for this issue in the libunwind implementation. We can loop over all candidates that match the hint.
I don't think that's possible. The offset points into middle of a stream so it's essentially decoding garbage. Sometimes the garbage can make sense, sometimes not, but it's not easy to tell whether it's a false hit.
Can we build our own hint table from the broken hint table? Something like:
If the dwarf stream is more than 16MB:
This would fix our unwinder, but it would not fix other unwinders. For example, I would expect C++ EH to be still broken.
We can probably build our own hint table from scratch during compilation. It needs to cover only "managed code" section and hence doesn't need much of a linker input, as long as the DWARF section is preserved in one piece (I think it is).
Reconstructing the hints at runtime from the linker output may be possible but at that point you get a penalty similar to not using the hints at all and just creating the cache by sequentially reading the DWARF section. That's incredibly slow even on small executables though.
This would fix our unwinder, but it would not fix other unwinders. For example, I would expect C++ EH to be still broken.
That's a fair point. I didn't consider other unwinders. If we want other unwinders to work then we basically have to either 1) generate compact unwinding codes where possible (needs codegen changes), 2) implement "compression" for DWARF by identifying common prolog sequences and sharing their code (possible in the DWARF format but difficult to implement), 3) fix it the linker output to not use the hints when they overflow (the slow lookup is too slow for NativeAOT purposes thought). I didn't check what the new Apple linker (ld-prime) produces in this case.
Reconstructing the hints at runtime from the linker output may be possible but at that point you get a penalty similar to not using the hints
The hint table is not big and the reconstructed hint table can be cached. I think the penalty would be fairly small.
The hint table is not big and the reconstructed hint table can be cached.
In the executable from OP the size of __unwind_info
(compact unwinding table) is 0x89beb8 bytes. The size of __eh_frame
is 0x1bc0e28. So, 9Mb for the hint table may not seem like much but it's definitely going to be noticeable. If it was done lazily then you risk running it during thread hijacking on GC suspend. That would almost certainly take long enough to cause live locks when the threads get hammered with the "suspend all thread hijack" logic.
On second thought I don't think it's even reliably possible to reconstruct the DWARF offsets solely from __unwind_info
since it's sorted, and hence it's not guaranteed that the DWARF offsets are in order. I suspect that in this case they would be, but it feels fragile.
There’s potentially an easy win in terms of the ARM64 DWARF size with folding the extremely common sequences into a DWARF CIE and referencing that. That’s a variation of the “DWARF compression” strategy mentioned earlier, just restricted to specific known sequences.
For example, the prolog for frame with no callee saved registers (except LR and FP) is this:
DW_CFA_advance_loc: 4
DW_CFA_def_cfa_offset: +16
DW_CFA_offset: W29 -16
DW_CFA_offset: W30 -8
DW_CFA_advance_loc: 4
DW_CFA_def_cfa_register: W29
DW_CFA_nop:
DW_CFA_nop:
DW_CFA_nop:
DW_CFA_nop:
DW_CFA_nop:
The code looks like this:
stp x29, x30, [sp, #-10]!
mov x29, sp
…
ldr x29, x30, [sp], #0x10
ret
This repeats 20000+ times in the OP executable. Besides being foldable in the DWARF codes it’s also likely expressible as compact unwind code with no codegen changes. We would still need to implement special prolog treatment for asynchronous unwinding with the compact unwind codes though, so the DWARF way could be easier (and benefit other platforms too).
I tried to replace the empty frame DWARF sequence with compact unwinding and it saves 32% of the DWARF section size for this particular executable. Similar savings are present for empty iOS app from template (dotnet new ios
). It's not enough to push the DWARF size below the problematic size but it's significant enough that it may be an option worth exploring.
On macOS the unwind information is stored as the compact unwinding encoding and the DWARF EH encoding. The compact unwinding serves as a lookup table to the DWARF section (if the whole unwinding cannot be expressed using compact code, which NativeAOT doesn't currently produce). The "hint offset" into the DWARF table is 24-bit on both ARM64 and x64. Turns out, if the offset is longer, then it gets silently truncated and results in incorrect pointers into the DWARF section. This in turn results in unwinding not working properly and app freeze due to live lock between stuck
FindMethodInfo
and GC suspensions.Example stack trace:
The
fdeSectionOffsetHint=851328
is0xCFD80
. The DWARF dump is a bit too big too upload but 0xCFD80 points into a middle of a record. There is, however, a start of record at 0x10CFD80 and it matches the PC 0x10338DE20 from the stack trace: