android / ndk

The Android Native Development Kit
1.99k stars 257 forks source link

debug rcdailey's zlib unwind issues #457

Closed DanAlbert closed 7 years ago

DanAlbert commented 7 years ago

@rcdailey

Forking from https://github.com/android-ndk/ndk/issues/230#issuecomment-315478458

Not entirely clear what's going on just yet. It certainly looks like the unwinders are crossing the streams, but that shouldn't be happening with anything built with a modern NDK.

Could you check and make sure that all the _Unwind_* symbols in your libraries are hidden?

DanAlbert commented 7 years ago

No worries, wasn't taken as such :) I'm well aware that ours isn't ideal right now. I just wanted to point out that although ours should be cleaned up, ours seems to work, whereas the clean implementation does not. If you want your build to work again ASAP, ours might be the better choice.

Brad has constantly told me that the design intent for toolchain files in CMake has been to be very simple things that do not do any system introspection.

Yeah, we've had this conversation with him too. He's got us pointed in the right direction, it's just going to take some time to get it done.

Essentially what we'll have when this is done is the built-in CMake pieces will code to hook into CMake modules that are shipped in the NDK. That keeps CMake out of the business of tracking a dozen variations for each NDK version and avoids the issue of CMake not working for NDK rN+1 until the next version of CMake is available.

It looks like we don't have a bug for this filed right now. I'll track down the WIP commits and file one. I'll get you, Brad, and the Studio engineer that was working on this CC'd.

rcdailey commented 7 years ago

That's great news Dan. I'm glad that it's being worked on behind the scenes. Happy to help where I can.

For now I'll try to adopt the built in toolchain file. This seems to be the most scalable solution right now since as I upgrade the NDK I won't have to constantly mess with my own android-specific configuration. If nothing else I want to do it to see if it fixes the problems I'm seeing.

I'll let you know what happens.

rcdailey commented 7 years ago

Using your toolchain file, this is what I get:

$ "E:\android\android-ndk-r15b\toolchains\arm-linux-androideabi-4.9\prebuilt\windows-x86_64\bin\arm-linux-androideabi-readelf.exe" -sW bin\libzApp.so | grep _Unwind
 48999: 00000000     0 FUNC    GLOBAL DEFAULT  UND __gnu_Unwind_Find_exidx
156348: 010c4521    92 FUNC    LOCAL  DEFAULT   11 _ZN12_GLOBAL__N_1L14unwindOneFrameEjP21_Unwind_Control_BlockP15_Unwind_Context
156358: 010c463d   296 FUNC    LOCAL  DEFAULT   11 _ZL13unwind_phase2P13unw_context_tP21_Unwind_Control_Blockb
156666: 010c3f4d   700 FUNC    LOCAL  HIDDEN    11 _Unwind_VRS_Interpret
156667: 010c43b1   360 FUNC    LOCAL  HIDDEN    11 _Unwind_VRS_Pop
156669: 010c4209   208 FUNC    LOCAL  HIDDEN    11 _Unwind_VRS_Get
156670: 010c42d9   216 FUNC    LOCAL  HIDDEN    11 _Unwind_VRS_Set
156672: 010c458d   176 FUNC    LOCAL  HIDDEN    11 _Unwind_RaiseException
156673: 010c4765     2 FUNC    LOCAL  HIDDEN    11 _Unwind_Complete
156674: 010c4769   164 FUNC    LOCAL  HIDDEN    11 _Unwind_Resume
156675: 010c480d    64 FUNC    LOCAL  HIDDEN    11 _Unwind_GetLanguageSpecificData
156676: 010c484d    64 FUNC    LOCAL  HIDDEN    11 _Unwind_GetRegionStart
156677: 010c488d    12 FUNC    LOCAL  HIDDEN    11 _Unwind_DeleteException
205802: 00000000     0 FUNC    GLOBAL DEFAULT  UND __gnu_Unwind_Find_exidx

Looking better (?) already... still need to run it on device.

DanAlbert commented 7 years ago

Yep! That's the expected output. The __gnu_Unwind_Find_exidx one is actually provided by libc, so that one should be undefined, unlike the others.

rcdailey commented 7 years ago

Looks like the bundled toolchain file solves all my problems. Sorry I've been hard headed about this; I know you told me it would a while back. Somehow we need to communicate in the CMake documentation that the built-in support isn't ready to use yet. It has numerous problems the minute you decide to use LLVM STL.

Now my only gripe is that backtraces in tombstones end where abort() is invoked by standard assert()... I was really hoping switching away from GNU STL would fix this, but doesn't seem so. Anyway that's unrelated to this issue and I don't want to start down a tangent...

Thanks for everything guys!

enh commented 7 years ago

(if you have problems unwinding on a modern release -- there's not much we can do about JellyBean -- please file a bug so our unwinding expert can take a look.)

rcdailey commented 7 years ago

I have one last question. Regarding the unwinding problems, I'm still seeing issues with short backtraces in tombstones. This makes it very difficult to diagnose segfaults. This makes it hard to tell if this is a problem with the NDK, my code, the way I'm building librarires, or an issue with Android OS itself. Is there a way I can get in touch with some android OS developers (this "unwinding expert" maybe) to ask further about this issue? @enh mentioned filing a bug, but where would I do this? Also I'm not 100% sure this is a bug yet at the OS level. I just need a good next step to take. This is not something I can figure out by myself, that's for sure. Thanks for everything so far everyone.

enh commented 7 years ago

if you can create a reproduceable test case of a bad unwind (on a recent release, since there's nothing we can do about the distant past), @cferris1000 would be interested to see it...

rcdailey commented 7 years ago

Well in my case, I'm stuck on certain platforms because we manufacture and maintain our own ARM devices in-house. Those platforms being API 15, 17, and 22. If we could make custom changes to AOSP to fix the problem, since we maintain our own fork of Android itself, that's a feasible option if we can't get any fixes from Google upstream for older OS releases. If this is truly a problem with an older Android OS release, would it be feasible to get his help so we can fix it ourselves? I'm also not sure how to provide a reproducible test case, I'd have to discuss with him to know what I can do to help. Most of what I'm using is proprietary. Not sure the best way to communicate with him other than this github issue...

rcdailey commented 7 years ago

I found an internal issue from last year when I worked with our resident OS team to diagnose why back traces were not unwinding into our library code. At the time we were using GCC + GNU STL on NDK r10. Some quotes:

debugging backtraces generation I got the line that stops unwinding calls with comment:

// The first word is a place-relative pointer to a generic personality
// routine function. We don't support invoking such functions, so stop here.

A long time when I googled "personality routines", I came up with: https://issuetracker.google.com/issues/36982950 (not sure if it's relevant; this is outside of my domain of expertise)

Another member of the OS team stated:

There are a few mentions of Gabi++ not handling exceptions properly. Is this the C++ library you are using in your native code? It looks like the options for better handling would be a different C++ library port, or modifying the one you are using. http://mobilepearls.com/labs/native-android-api/ndk/docs/CPLUSPLUS-SUPPORT.html

Still a lot of uncertainty, but that's about all we were able to figure out. Maybe @cferris1000 has some insight we can use to make some changes on our side.

enh commented 7 years ago

you could rip out the unwinder and replace it with a newer one, but that would probably mean upgrading many other parts of the tree too! 15 and 17 are old enough that you're using libcorkscrew and don't have a decent STL. 22 will at least be libc++ and libunwind, but a pretty old libunwind. you might be able to backport the current version, but you should probably try running your crashing code on a new release first to see whether it's actually worthwhile. (we're actually moving on from libunwind to our own unwinder at this point, but that won't be the default until P.)

rcdailey commented 7 years ago

So if 22 is libc++, what is 15 and 17? Are you talking about libc++_shared.so in the NDK? I do link against that and ship it with my APK, even for the older devices.

enh commented 7 years ago

i meant the platform STL. 15 and 17 were still stlport.

rcdailey commented 7 years ago

Well I can still pick gnustl or c++stl with platform set to android-15. You're saying that the underlying implementation is borrowed from stlport? Meaning that libc++ was not actually used (because libc was not up to par until API 21)?

DanAlbert commented 7 years ago

No. For platform code (things in an AOSP) you're not using the NDK unless the module has LOCAL_SDK_VERSION set. In that case (for releases older than M), you don't have an STL by default. Prior to L, you could opt in to stlport (/system/lib/libstlport.so). Starting with L, you could opt in to that or libc++ (/system/lib/libc++.so). Starting with M, libc++ was the default. All of these are built as part of the system image and the binaries are not portable.

For the NDK you do not (cannot and should not) use those libraries because the ABI is not stable (/system/lib/libstlport.so doesn't even exist on a modern release), so you pack the STL with your application. This is where libc++_shared.so, libstlport_shared.so, and libgnustl_shared.so come in. You can use any of these to target any release.

rcdailey commented 7 years ago

Thanks for the information Dan & enh. I'm going to leave the backtrace issues alone for now, I'm definitely getting better backtraces on API 22 devices; so it's an OS problem and out of scope for NDK at this point I think. Thank you for helping me with those questions even though it was a distraction.

Back to NDK compatibility, I'm getting exceptions from different parts of boost that I wasn't seeing when I linked against GNU STL. Because of the lack of backtrace, I'm not able to deduce exactly why these things are causing segfaults. In one case, boost::filesystem reported a "Function not implemented" exception when using operator++ with directory_iterator:

boost::filesystem::directory_iterator::operator++: Function not implemented

The above was the string in the what() of the filesystem exception. For as difficult as boost code is to read, I was only able to determine that this seems to come from low-level code in boost used to interface with platform-specific APIs and libraries for filesystem operations:

      temp_ec = dir_itr_increment(it.m_imp->handle,
#       if defined(BOOST_POSIX_API)
        it.m_imp->buffer,
#       endif
        filename, file_stat, symlink_file_stat);

      if (temp_ec)  // happens if filesystem is corrupt, such as on a damaged optical disc
      {
        path error_path(it.m_imp->dir_entry.path().parent_path());  // fix ticket #5900
        it.m_imp.reset();
        if (ec == 0)
          BOOST_FILESYSTEM_THROW(
            filesystem_error("boost::filesystem::directory_iterator::operator++",
              error_path,
              error_code(BOOST_ERRNO, system_category())));
        ec->assign(BOOST_ERRNO, system_category());
        return;
      }

Above is from filesystem/src/operations.cpp in Boost 1.64.0.

Secondly, A really common and consistent one I'm getting as well is related to boost::lexical_cast. When I use it to cast a string to bool, I get this in my tombstone:

backtrace:
    #00  pc fffffffc  <unknown>
    #01  pc 006e0965  /data/app-lib/com.ttm.zapp-2/libzApp.so (boost::exception_detail::refcount_ptr<boost::exception_detail::error_info_container>::release()+34)
    #02  pc 006e0907  /data/app-lib/com.ttm.zapp-2/libzApp.so (boost::exception_detail::refcount_ptr<boost::exception_detail::error_info_container>::~refcount_ptr()+16)

stack:
         60155968  60155a90  
         6015596c  5e9f3ac9  /data/app-lib/com.ttm.zapp-2/libzApp.so (ErrorLog::PrintLog(Severity, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)+416)
         60155970  00000063  
         60155974  580da728  
         60155978  580da728  
         6015597c  580da728  
         60155980  601559a0  
         60155984  5e9cf93f  /data/app-lib/com.ttm.zapp-2/libzApp.so (boost::exception_detail::refcount_ptr<boost::exception_detail::error_info_container>::adopt(boost::exception_detail::error_info_container*)+38)
         60155988  5cec4700  
         6015598c  580da728  
         60155990  00000000  
         60155994  580da728  
         60155998  00000000  
         6015599c  60155a00  
         601559a0  df0027ad  
         601559a4  00000000  
    #00  601559a8  601559c0  
         ........  ........
    #01  601559a8  601559c0  
         601559ac  60155a34  
         601559b0  60155a34  
         601559b4  60155a34  
         601559b8  601559d0  
         601559bc  5e9cf90b  /data/app-lib/com.ttm.zapp-2/libzApp.so (boost::exception_detail::refcount_ptr<boost::exception_detail::error_info_container>::~refcount_ptr()+20)
    #02  601559c0  0000000c  
         601559c4  60155a34  
         601559c8  60155a34  
         601559cc  60155a34  
         601559d0  601559e8  
         601559d4  5e9cf99d  /data/app-lib/com.ttm.zapp-2/libzApp.so (boost::exception::~exception()+36)
         601559d8  60155a88  
         601559dc  60155a30  
         601559e0  60155a30  
         601559e4  60155a30  
         601559e8  60155a00  
         601559ec  5e9cf3b7  /data/app-lib/com.ttm.zapp-2/libzApp.so (boost::exception_detail::error_info_injector<boost::bad_lexical_cast>::~error_info_injector()+26)
         601559f0  60155a30  
         601559f4  60155a24  
         601559f8  60155a24  
         601559fc  60155a24  

All of these issues give me the feeling that there's something that boost or STL is using that isn't supported on this older version of Android. Maybe something related to the system libs, although after doing readelf and such, as well as having -W1,--no-undefined linker flag, I'm not dealing with any linker / missing symbols issues at the build stage. I have no idea at this point.

Note that in these failure cases, my minimum platform is android-15 and I'm running on a device running Android 17. When I take that same build (using x86 architecture instead of ARM, due to the different hardware) and run it on Android OS at API 22 (minimum still set to android-15) I do not get these errors/exceptions.

Any hints that could point me in the right direction here? To stay within scope of the conversation here, I at least want to rule out any NDK or build configuration problems (CMake bugs, NDK toolchain issues, etc). If this keeps up and there is no solution (worst case: OS is too old and we can't use LLVM STL), then I'll be forced to switch back to GNU STL for the older platforms. What's odd is the OS hasn't changed when I switch to GNU STL, that's the only reason I'm not able to feel completely confident this is an OS issue.

Again thanks for all the help so far, I really appreciate the continued support. I'm lost without you guys helping me out lol.

cferris1000 commented 7 years ago

The older devices don't support dwarf unwind information on arm, but newer devices do. So if something is using just dwarf for some functions, you would see this behavior.

rcdailey commented 7 years ago

@cferris1000 is that something configurable on older devices? Or we're stuck? My hope is that even if it's not configurable, there are maybe some light code changes to android OS we can make to improve this.

cferris1000 commented 7 years ago

I don't think there is any easy way to do it, we switched to a completely different unwinder in newer versions, and isolated all of the unwind calls through a new library. You'd need to pull in external/libunwind and system/core/libbacktrace and then modify system/core/debuggerd to use the libbacktrace code. This has been done in the new OS versions, but that code is a lot different now and requires newer features to work.

Worse yet, that code might require a newer version of clang (and might not compile with gcc), so you are fighting an uphill battle.

rcdailey commented 7 years ago

I was afraid of that... would it be fair to say then, that on older platforms, native application developers are screwed? Without on-device debugging, there's no way I've found to be able to properly diagnose segfaults due to the lack of backtraces in tombstones. I wish I could magically introduce some library in my application code to do this instead of the OS, but that doesn't seem feasible.

Thanks for the information.

cferris1000 commented 7 years ago

There is a plan to create an unwinder library in the NDK, so that app developers can use it everywhere to get good backtraces. There is no current eta on it though.

Otherwise, yes, some crashes on these older systems might not give you good backtraces.

rcdailey commented 7 years ago

Per my post earlier, I'm suspecting I may have to switch back to GNU STL. Hopefully @DanAlbert can prove me wrong, but I'm getting too many strange runtime issues that I don't know how to deal with. Optimistically, if these are NDK problems I'd like to help get them resolved however I can.

BTW, thanks @cferris1000 for the information!

webmaster128 commented 7 years ago

Maybe start with the filesystem_error you are receiving from boost? What is the root cause of it?

This is a bug in Boost: https://svn.boost.org/trac10/ticket/13172 that can be fixed with https://gist.github.com/webmaster128/5912a70d100e9ef341df67b177c465d6

cc @rcdailey

rcdailey commented 7 years ago

Excellent, thank you for finding and fixing that bug. I hit a wall and assumed it was an LLVM STL issue, so I gave up and switched back to GNU STL. I will look forward to the version of boost that includes your fix!

On Mon, Aug 21, 2017, 1:40 PM Simon Warta notifications@github.com wrote:

Maybe start with the filesystem_error you are receiving from boost? What is the root cause of it?

This is a bug in Boost: https://svn.boost.org/trac10/ticket/13172 that can be fixed with https://gist.github.com/webmaster128/5912a70d100e9ef341df67b177c465d6

cc @rcdailey https://github.com/rcdailey

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/android-ndk/ndk/issues/457#issuecomment-323819967, or mute the thread https://github.com/notifications/unsubscribe-auth/ABr6dluFNJbMVoLnezOmP-YWdcS7X8Ajks5sac8EgaJpZM4OYzR5 .