android / ndk

The Android Native Development Kit
1.99k stars 257 forks source link

[BUG] Exceptions in arm64-v8a are slower than in armeabi-v7a #1062

Open Trivias opened 5 years ago

Trivias commented 5 years ago

Description

Exceptions on arm64-v8a are much slower than on armeabi-v7a. Profiling shows that most of the work is done in dl_iterate_phdr callback. Severity of this effect is proportional to the number of objects (and, possibly, their size) loaded by dynamic linker. Attached archive contains a very synthetic example that creates a hundred libraries and measures time of exception handling in an executable that depends on all of them. I received following results when running this example on the phone:

sizeof(void*) = 4
Caught 10000 exceptions in 879 ms
sizeof(void*) = 8
Caught 10000 exceptions in 7201 ms

Please set NDK and HOST_TAG environment variables before running the test.

Exceptions in a native code run from an actual Java application for Android is about ten times slower in arm64 version of the library than in armv7.

extest.tar.gz

Environment Details

rprichard commented 5 years ago

@cferris1000

I haven't benchmarked it, but it makes sense that non-arm32 unwinding would scale poorly with the number of libraries, relative to arm32. I think the unwinder needs to find the unwind info for a PC (PT_GNU_EH_FRAME or PT_ARM_EXIDX segment?), and it needs to query the dynamic linker for that info.

libgcc uses an 8-element LRU cache for this PC->eh_frame lookup. (It moves matches to the front of a linked list.) However, the cache is disabled on Android because it relies on extra glibc fields in the dl_phdr_info struct to invalidate the cache after a library is loaded or unloaded. If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.

Possible fixes (for new versions of Android):

I wondered if the unwinder could use dladdr instead, but dladdr scans dynsym for a matching symbol, so it's also slow.

rprichard commented 5 years ago

Internal issues for the new Android unwinder -- http://b/65682711, http://b/66918884.

rprichard commented 5 years ago

If exceptions are routinely thrown through more than 8 libraries, I would guess the cache is effectively disabled.

I suspect this overstates it -- a call stack might go through more than 8 libraries, but if there are multiple call stack entries in a single DSO, I'd guess the cache is still useful.

I uploaded a WIP patch adding the extra dl_iterate_phdr fields, https://android-review.googlesource.com/c/platform/bionic/+/1104597. It improves the arm64 run-time on the extest from ~7000-7400ms to ~940ms.

Trivias commented 5 years ago

Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?

While this itself is great, I think that there is a way to somewhat alleviate this issue for the users today by applying a patch to gcc unwinder.

I tried to rebuild armv8 version of libc++_shared.so with llvm/libunwind (which is used in armeabi-v7a version by default) and discovered that it is about 5 times faster in the exception-heavy places than libgcc unwinder version. Further debugging showed that llvm/libunwind dl_iterate_phdr callback contains an if statement, that rejects most of the loaded objects in the beginning of this callback here.

I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?

rprichard commented 5 years ago

Am I right in assuming that this patch fixes the issue, but is applied to Android platform itself and not to the NDK, so this behavior will only be fixed in Android 11?

Yeah, the patch I submitted would only help with Android 11, so it'd be nice to do something for existing OSs.

I do not see such behavior in libgcc _Unwind_IteratePhdrCallback function. Do you think it's possible to port such optimization to libgcc in NDK and if it is where should I submit a patch?

It looks possible. I think I can upload a patch for it.

rprichard commented 5 years ago

I uploaded a libgcc CL to https://android-review.googlesource.com/c/toolchain/gcc/+/1130238.

This CL didn't seem to help by 5x, and it made the EH performance much more variable. e.g. Instead of 6300-6700ms, I see results from 2500-5200ms. I suspect the variability comes from the varying addresses of loaded libraries. The optimization might fire on all the loaded libraries, or none of them.

I haven't tried using llvm/libunwind with arm64.

rprichard commented 5 years ago

I tried LLVM's unwinder, using the version in the NDK that's already used for arm32. On extest, I saw runtimes varying between 3700-6800ms. It doesn't have a cache at all, so it doesn't benefit from the new adds/subs fields in new versions of Android. I didn't see a 5x improvement. The upstream version doesn't have a cache either.

I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.

rprichard commented 5 years ago

I wonder if I can add a cache where, on a hit, we scan dl_iterate_phdr very quickly looking for the hit.

It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1

Trivias commented 4 years ago

It seems to work. I see reliable extest run-times of ~1500ms on Q and about ~900ms on master. https://android-review.googlesource.com/c/toolchain/gcc/+/1133815/1

We tried to build our application with libc++_shared.so patched with these changes and also got significant performance improvement in exception-heavy places. But I see that code review is more than 9 months old and still open. Is there a chance that these changes will be included in the official NDK build in the future?

DanAlbert commented 4 years ago

The first release that it could be shipped in (r23) is also the release where we plan to stop using that unwinder, so getting that change submitted won't buy us anything. I'll triage this to r23 though, so if we end up not finishing that migration we can revisit it.

DanAlbert commented 4 years ago

Spoke with @rprichard to clarify and the LLVM unwinder will need a similar improvement. LLVM's unwinder should be quite fast on the newer Android releases that include the libdl improvements mentioned above, but it'll need the same caching improvements to improve performance on older devices.