getsentry / sentry-java

A Sentry SDK for Java, Android and other JVM languages.
https://docs.sentry.io/
MIT License
1.14k stars 432 forks source link

SIGABRT crash #3631

Closed stefanosiano closed 1 month ago

stefanosiano commented 1 month ago

Description

When i try to run our sample app on my physical device (Xiaomi) it crashes with this message: Status.cpp:143] Failed HIDL return status not checked. Usually this happens because of a transport error (error parceling, binder driver, or from unparceling). If you see this in code calling into "Bn" classes in for a HAL server process, then it is likely that the code there is returning transport errors there (as opposed to errors defined within its protocol). Error is: Status(EX_TRANSACTION_FAILED): 'DEAD_OBJECT: ' Crashes are captured by our SDK, so we can see them here.

If i remove the NDK integration, no crashes occur. This happens only on my device, so perhaps specific to Xiaomi?

markushi commented 1 month ago

Android OS Version: 12

markushi commented 1 month ago

@stefanosiano maybe as a first step enable the NDK, but disable the NDK scope sync.

stefanosiano commented 1 month ago

@stefanosiano maybe as a first step enable the NDK, but disable the NDK scope sync.

Already did it, and the crash is still happening

supervacuus commented 1 month ago

I'm not sure what's happening here. The messages indicate that a binder transaction failed, but we are not interacting with/via binder with any system service. I could imagine that H2BGraphicBufferProducer actually aborts due to a failed assertion, and our handler picks up the SIGABRT, but maybe we shouldn't. In any case, I would have to dive in a bit more to give a sensible assessment.

markushi commented 1 month ago

@supervacuus here's the full stacktrace we see on sentry.io

OS Version: Android 12 (SKQ1.211006.001 test-keys)
Report Version: 104

Exception Type: Unknown (SIGABRT)

Application Specific Information:
Abort

Thread 0 Crashed:
0   libc.so                         0x7807709a28        abort
1   libart.so                       0x777e6f9fa8        art::Runtime::Abort
2   libbase.so                      0x7818dd8ea8        <unknown> + 515813248680
3   libbase.so                      0x780b7d7184        android::base::LogMessage::~LogMessage
4   libhidlbase.so                  0x7809894670        android::hardware::details::return_status::assertOk
5   libhidlbase.so                  0x78098946b8        android::hardware::details::return_status::~return_status
6   libgui.so                       0x780c9c77e0        android::hardware::graphics::bufferqueue::V1_0::utils::H2BGraphicBufferProducer::getFrameTimestamps
7   libgui.so                       0x780c9aabac        android::Surface::enableFrameTimestamps
8   libgui.so                       0x780c9af84c        android::Surface::perform
9   libgui.so                       0x780c9ab6e8        android::Surface::performInternal
10  libhwui.so                      0x780c119680        <unknown> + 515598554752
11  libgui.so                       0x780c9aa290        android::Surface::hook_perform
12  libhwui.so                      0x780c307e9c        <unknown> + 515600580252
13  libhwui.so                      0x780c1f8db0        <unknown> + 515599470000
14  libhwui.so                      0x780c0134bc        <unknown> + 515597481148
15  libhwui.so                      0x780c1d5800        <unknown> + 515599325184
16  libhwui.so                      0x780c1d5560        <unknown> + 515599324512
17  libutils.so                     0x781ca2d58c        android::Thread::_threadLoop
18  libutils.so                     0x781ca2cde8        <unknown> + 515876507112
19  libc.so                         0x780776eb14        <unknown> + 515521309460
20  libc.so                         0x780770b35c        <unknown> + 515520901980

What's good about this issue is that we can reproduce it locally in our debug builds. Is there anything @stefanosiano could try out?

stefanosiano commented 1 month ago

@supervacuus i tried the sample app of the sentry-native repo and ii's working fine, no crashes here

supervacuus commented 1 month ago

@supervacuus i tried the sample app of the sentry-native repo and ii's working fine, no crashes here

Do you mean the NDK sample inside the Native SDK repo? That should initialize the native library the same way as in the sentry-android sample. Maybe the latter uses some feature that interacts differently with the platform code on the Xiaomi device. For this, it would be important to understand at which point this abort() is provoked and whether it always happens at some point in the execution.

What's good about this issue is that we can reproduce it locally in our debug builds. Is there anything @stefanosiano could try out?

I would be very interested in seeing the tombstone for the crash. As is often the case, the stack trace of the crashing thread is rarely useful on Android. In this case, it is platform code that crashes, but for us, the only thing that is interesting is what we are doing during or before the abort. This may be visible in one of the other thread stack traces.

supervacuus commented 1 month ago

I was also reminded of whether system tracing is somehow enabled in the sentry-android sample app you are running. Something like frame-duration tracing could lead to failed DEAD_OBJECT binder transactions in the context of buffer-queue producers in pathologic cases.

supervacuus commented 1 month ago

I was also reminded of whether system tracing is somehow enabled in the sentry-android sample app you are running. Something like frame-duration tracing could lead to failed DEAD_OBJECT binder transactions in the context of buffer-queue producers in pathologic cases.

Another thing that I could imagine is - to root the difference between sample behavior - whether the sample does any high-pressure graphics-buffer stuff (like recurring screenshots via PixelCopy, which is synchronized) that could put a load on the buffer-queue/binder pipeline.

stefanosiano commented 1 month ago

@supervacuus Thanks for the info I made some other tests, and found that it happens only when session replay is enabled (along with tracing, profiling and everything else) - which takes screenshots every second, using bitmap i think. What i don't understand is why it happens only when sentry-native is also enabled. If I use sentry-android-core - that doesn't include sentry-native - everything works fine. Here are the tombstones with adb bugreport, if these help in any way

bugreport-lmi_eea_poco-SKQ1.211006.001-2024-08-06-10-51-50.zip

supervacuus commented 1 month ago

I made some other tests, and found that it happens only when session replay is enabled (along with tracing, profiling and everything else) - which takes screenshots every second, using bitmap i think.

That would at least mean that there is some obvious difference in bufferqueue/binder usage between the two samples. Isolating this is definitely helpful even if it might not be the culprit.

Here are the tombstones with adb bugreport, if these help in any way

bugreport-lmi_eea_poco-SKQ1.211006.001-2024-08-06-10-51-50.zip

This is super interesting: every time your app crashes in the RenderThread, immediately before there is another tombstone produced for the media.codec hw/android.hardware.media.omx@1.0-service which crashes with a null dereference in its HWBinder thread which is in the opposite buffer producer and also in the getFrameTimestamps() path. This process is the DEAD_OBJECT that the binder transaction exception refers to:

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'google/bonito/bonito:10/QQ3A.200805.001/6578210:user/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2024-08-06 10:15:54.152234376+0200
Process uptime: 0s
Cmdline: media.codec hw/android.hardware.media.omx@1.0-service
pid: 32210, tid: 32339, name: HwBinder:32210_  >>> media.codec <<<
uid: 1046
signal 0 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr --------
Cause: null pointer dereference
    r0  00000048  r1  e5d52c08  r2  e5d52c18  r3  000039f3
    r4  00000000  r5  00007dd2  r6  00000000  r7  e5d52c18
    r8  e8ebe274  r9  e5d52c40  r10 e5d52cb0  r11 e5d53060
    ip  e7cf935c  sp  e5d52ba0  lr  e7ca8ca9  pc  e82e6212

backtrace:
      #00 pc 0001f212  /apex/com.android.vndk.v30/lib/libui.so (android::FenceTime::Snapshot::getFlattenedSize() const+2) (BuildId: 64c4a3578c87c779e3001f245b9d7a5d)
      getsentry/sentry-native#1 pc 00052ca5  /apex/com.android.vndk.v30/lib/libgui.so (android::FrameEventHistoryDelta::getFlattenedSize() const+16) (BuildId: d0ff69429501def0367515e2f266214b)
      getsentry/sentry-native#2 pc 00055405  /apex/com.android.vndk.v30/lib/libgui.so (android::conversion::wrapAs(android::hardware::graphics::bufferqueue::V1_0::IGraphicBufferProducer::FrameEventHistoryDelta*, std::__1::vector<std::__1::vector<native_handle*, std::__1::allocator<native_handle*> >, std::__1::allocator<std::__1::vector<native_handle*, std::__1::allocator<native_handle*> > > >*, android::FrameEventHistoryDelta const&)+28) (BuildId: d0ff69429501def0367515e2f266214b)
      getsentry/sentry-native#3 pc 0000ce11  /apex/com.android.vndk.v30/lib/libstagefright_bufferqueue_helper.so (android::TWGraphicBufferProducer<android::hardware::graphics::bufferqueue::V1_0::IGraphicBufferProducer, void>::getFrameTimestamps(std::__1::function<void (android::hardware::graphics::bufferqueue::V1_0::IGraphicBufferProducer::FrameEventHistoryDelta const&)>)+128) (BuildId: bcf1848f23152cf7b88c111d515b1f94)
      getsentry/sentry-native#4 pc 000172ad  /apex/com.android.vndk.v30/lib/android.hardware.graphics.bufferqueue@1.0.so (android::hardware::graphics::bufferqueue::V1_0::BnHwGraphicBufferProducer::_hidl_getFrameTimestamps(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+140) (BuildId: 5e2b8d3c591a311d2f0287a55ee1f2a4)
      getsentry/sentry-native#5 pc 00017a8f  /apex/com.android.vndk.v30/lib/android.hardware.graphics.bufferqueue@1.0.so (android::hardware::graphics::bufferqueue::V1_0::BnHwGraphicBufferProducer::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+1462) (BuildId: 5e2b8d3c591a311d2f0287a55ee1f2a4)
      getsentry/sentry-native#6 pc 0005d01d  /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+48) (BuildId: 3445ac5422fc45c8888fee3c33523157)
      getsentry/sentry-native#7 pc 0005faaf  /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::IPCThreadState::getAndExecuteCommand()+966) (BuildId: 3445ac5422fc45c8888fee3c33523157)
      getsentry/sentry-native#8 pc 0006088d  /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::IPCThreadState::joinThreadPool(bool)+56) (BuildId: 3445ac5422fc45c8888fee3c33523157)
      getsentry/sentry-native#9 pc 0006c5fd  /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::PoolThread::threadLoop()+12) (BuildId: 3445ac5422fc45c8888fee3c33523157)
      getsentry/sentry-native#10 pc 0000eed9  /apex/com.android.vndk.v30/lib/libutils.so (android::Thread::_threadLoop(void*)+168) (BuildId: 373fcfc8fb18977f88e89ad09552a738)
      getsentry/sentry-native#11 pc 0000ea15  /apex/com.android.vndk.v30/lib/libutils.so (thread_data_t::trampoline(thread_data_t const*)+256) (BuildId: 373fcfc8fb18977f88e89ad09552a738)
      getsentry/sentry-native#12 pc 000a8e57  /apex/com.android.runtime/lib/bionic/libc.so (__pthread_start(void*)+40) (BuildId: 14ccc210ec59d35990c4377f0f48f77e)
      getsentry/sentry-native#13 pc 00061dd3  /apex/com.android.runtime/lib/bionic/libc.so (__start_thread+30) (BuildId: 14ccc210ec59d35990c4377f0f48f77e)

What i don't understand is why it happens only when sentry-native is also enabled. If I use sentry-android-core - that doesn't include sentry-native - everything works fine.

I can't say at this point. It seems it is reporting an honest crash and that crash is caused outside the the sample process (but initiated from your RenderThread), so I am not really sure how it could affect it.

When exactly does the crash happen? Right at the start? After some time? Do you interact with the app in some way? When you don't include the Native SDK, do you see anything in the logcat message that hints at the omx media service?

Another thing I find curious is why the that services is contacted in the first place, does the sample have any media playback or recording in use? Are the session replay frames encoded at some point using OpenMax?

stefanosiano commented 1 month ago

When exactly does the crash happen?

Between 1-5 seconds after the start

Do you interact with the app in some way?

Nope, i don't do anything at all

do you see anything in the logcat message that hints at the omx media service?

I don't think so, but here is the full logcat (copied without any filter) logcat.txt

does the sample have any media playback or recording in use?

The only thing that comes to my mind is session replay

Are the session replay frames encoded at some point using OpenMax?

There is no reference to OpenMax in the code. Also, the replays don't use native code at all, only system APIs

But if the crashing process is hw/android.hardware.media.omx@1.0-service, we shouldn't crash, right? Or am i missing something here?

stefanosiano commented 1 month ago

There was some confusion here on my side, sorry about it. sentry-android also pulls in sentry-android-replay, other than sentry-native This means that when i removed sentry-android in favor of sentry-android-core i also removed session replays. Trying again with sentry-android-core (without native) and sentry-android-replay make it crash again. Probably an encoder issue in the replay side. Moving this issue away from sentry-native