Open Trivias opened 5 years ago
@cferris1000
I haven't benchmarked it, but it makes sense that non-arm32 unwinding would scale poorly with the number of libraries, relative to arm32. I think the unwinder needs to find the unwind info for a PC (PT_GNU_EH_FRAME or PT_ARM_EXIDX segment?), and it needs to query the dynamic linker for that info.
On arm32, there is a special-purpose function (__gnu_Unwind_Find_exidx
in libc that delegates to dl_unwind_find_exidx
in libdl). This function takes a PC value and returns the location of the matching PT_ARM_EXIDX segment. The linker still just does a linear scan of all libraries, but it's in one function, find_containing_library
:
for (soinfo* si = solist_get_head(); si != nullptr; si = si->next) {
if (address < si->base || address - si->base >= si->size) {
continue;
}
... snipped -- verify that the address matches a PT_LOAD segment ...
}
Once the DSO is found, the location of the EXIDX is returned from the soinfo struct:
*pcount = si->ARM_exidx_count;
return reinterpret_cast<_Unwind_Ptr>(si->ARM_exidx);
There is no similar function for arm64, so I think the unwinder (in libgcc) instead uses dl_iterate_phdr
. dl_iterate_phdr
reports each library using dl_phdr_info
, which doesn't report a start/end address. For every loaded library, until we find the matching library:
libgcc uses an 8-element LRU cache for this PC->eh_frame lookup. (It moves matches to the front of a linked list.) However, the cache is disabled on Android because it relies on extra glibc fields in the dl_phdr_info
struct to invalidate the cache after a library is loaded or unloaded. If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.
Possible fixes (for new versions of Android):
dl_phdr_info
fields, unsigned long long dlpi_adds
and unsigned long long dlpi_subs
. The fields are documented as incrementing when a library is loaded or unloaded. The fields should enable the libgcc caching.dl_unwind_find_exidx
but for archs other than arm32.I wondered if the unwinder could use dladdr instead, but dladdr scans dynsym for a matching symbol, so it's also slow.
Internal issues for the new Android unwinder -- http://b/65682711, http://b/66918884.
If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.
I suspect this overstates it -- a call stack might go through more than 8 libraries, but if there are multiple call stack entries in a single DSO, I'd guess the cache is still useful.
I uploaded a WIP patch adding the extra dl_iterate_phdr
fields, https://android-review.googlesource.com/c/platform/bionic/+/1104597. It improves the arm64 run-time on the extest from ~7000-7400ms to ~940ms.
Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?
While this itself is great, I think that there is a way to somewhat alleviate this issue for the users today by applying a patch to gcc unwinder.
I tried to rebuild armv8 version of libc++_shared.so with llvm/libunwind (which is used in armeabi-v7a version by default) and discovered that it is about 5 times faster in the exception-heavy places than libgcc unwinder version. Further debugging showed that llvm/libunwind dl_iterate_phdr callback contains an if statement, that rejects most of the loaded objects in the beginning of this callback here.
I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?
Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?
Yeah, the patch I submitted would only help with Android 11, so it'd be nice to do something for existing OSs.
I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?
It looks possible. I think I can upload a patch for it.
I uploaded a libgcc CL to https://android-review.googlesource.com/c/toolchain/gcc/+/1130238.
This CL didn't seem to help by 5x, and it made the EH performance much more variable. e.g. Instead of 6300-6700ms, I see results from 2500-5200ms. I suspect the variability comes from the varying addresses of loaded libraries. The optimization might fire on all the loaded libraries, or none of them.
I haven't tried using llvm/libunwind with arm64.
I tried LLVM's unwinder, using the version in the NDK that's already used for arm32. On extest, I saw runtimes varying between 3700-6800ms. It doesn't have a cache at all, so it doesn't benefit from the new adds/subs fields in new versions of Android. I didn't see a 5x improvement. The upstream version doesn't have a cache either.
I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.
I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.
It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1
It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1
We tried to build our application with libc++_shared.so patched with these changes and also got significant performance improvement in exception-heavy places. But I see that code review is more than 9 months old and still open. Is there a chance that these changes will be included in the official NDK build in the future?
The first release that it could be shipped in (r23) is also the release where we plan to stop using that unwinder, so getting that change submitted won't buy us anything. I'll triage this to r23 though, so if we end up not finishing that migration we can revisit it.
Spoke with @rprichard to clarify and the LLVM unwinder will need a similar improvement. LLVM's unwinder should be quite fast on the newer Android releases that include the libdl improvements mentioned above, but it'll need the same caching improvements to improve performance on older devices.
Description
Exceptions on arm64-v8a are much slower than on armeabi-v7a. Profiling shows that most of the work is done in dl_iterate_phdr callback. Severity of this effect is proportional to the number of objects (and, possibly, their size) loaded by dynamic linker. Attached archive contains a very synthetic example that creates a hundred libraries and measures time of exception handling in an executable that depends on all of them. I received following results when running this example on the phone:
Please set NDK and HOST_TAG environment variables before running the test.
Exceptions in a native code run from an actual Java application for Android is about ten times slower in arm64 version of the library than in armv7.
extest.tar.gz
Environment Details