Closed stefanosiano closed 1 month ago
Android OS Version: 12
@stefanosiano maybe as a first step enable the NDK, but disable the NDK scope sync.
@stefanosiano maybe as a first step enable the NDK, but disable the NDK scope sync.
Already did it, and the crash is still happening
I'm not sure what's happening here. The messages indicate that a binder transaction failed, but we are not interacting with/via binder with any system service. I could imagine that H2BGraphicBufferProducer
actually aborts due to a failed assertion, and our handler picks up the SIGABRT
, but maybe we shouldn't. In any case, I would have to dive in a bit more to give a sensible assessment.
@supervacuus here's the full stacktrace we see on sentry.io
OS Version: Android 12 (SKQ1.211006.001 test-keys)
Report Version: 104
Exception Type: Unknown (SIGABRT)
Application Specific Information:
Abort
Thread 0 Crashed:
0 libc.so 0x7807709a28 abort
1 libart.so 0x777e6f9fa8 art::Runtime::Abort
2 libbase.so 0x7818dd8ea8 <unknown> + 515813248680
3 libbase.so 0x780b7d7184 android::base::LogMessage::~LogMessage
4 libhidlbase.so 0x7809894670 android::hardware::details::return_status::assertOk
5 libhidlbase.so 0x78098946b8 android::hardware::details::return_status::~return_status
6 libgui.so 0x780c9c77e0 android::hardware::graphics::bufferqueue::V1_0::utils::H2BGraphicBufferProducer::getFrameTimestamps
7 libgui.so 0x780c9aabac android::Surface::enableFrameTimestamps
8 libgui.so 0x780c9af84c android::Surface::perform
9 libgui.so 0x780c9ab6e8 android::Surface::performInternal
10 libhwui.so 0x780c119680 <unknown> + 515598554752
11 libgui.so 0x780c9aa290 android::Surface::hook_perform
12 libhwui.so 0x780c307e9c <unknown> + 515600580252
13 libhwui.so 0x780c1f8db0 <unknown> + 515599470000
14 libhwui.so 0x780c0134bc <unknown> + 515597481148
15 libhwui.so 0x780c1d5800 <unknown> + 515599325184
16 libhwui.so 0x780c1d5560 <unknown> + 515599324512
17 libutils.so 0x781ca2d58c android::Thread::_threadLoop
18 libutils.so 0x781ca2cde8 <unknown> + 515876507112
19 libc.so 0x780776eb14 <unknown> + 515521309460
20 libc.so 0x780770b35c <unknown> + 515520901980
What's good about this issue is that we can reproduce it locally in our debug builds. Is there anything @stefanosiano could try out?
@supervacuus i tried the sample app of the sentry-native repo and ii's working fine, no crashes here
@supervacuus i tried the sample app of the sentry-native repo and ii's working fine, no crashes here
Do you mean the NDK sample inside the Native SDK repo? That should initialize the native library the same way as in the sentry-android sample. Maybe the latter uses some feature that interacts differently with the platform code on the Xiaomi device. For this, it would be important to understand at which point this abort()
is provoked and whether it always happens at some point in the execution.
What's good about this issue is that we can reproduce it locally in our debug builds. Is there anything @stefanosiano could try out?
I would be very interested in seeing the tombstone for the crash. As is often the case, the stack trace of the crashing thread is rarely useful on Android. In this case, it is platform code that crashes, but for us, the only thing that is interesting is what we are doing during or before the abort. This may be visible in one of the other thread stack traces.
I was also reminded of whether system tracing is somehow enabled in the sentry-android sample app you are running. Something like frame-duration tracing could lead to failed DEAD_OBJECT
binder transactions in the context of buffer-queue producers in pathologic cases.
I was also reminded of whether system tracing is somehow enabled in the sentry-android sample app you are running. Something like frame-duration tracing could lead to failed
DEAD_OBJECT
binder transactions in the context of buffer-queue producers in pathologic cases.
Another thing that I could imagine is - to root the difference between sample behavior - whether the sample does any high-pressure graphics-buffer stuff (like recurring screenshots via PixelCopy
, which is synchronized) that could put a load on the buffer-queue/binder pipeline.
@supervacuus Thanks for the info I made some other tests, and found that it happens only when session replay is enabled (along with tracing, profiling and everything else) - which takes screenshots every second, using bitmap i think. What i don't understand is why it happens only when sentry-native is also enabled. If I use sentry-android-core - that doesn't include sentry-native - everything works fine. Here are the tombstones with adb bugreport, if these help in any way
bugreport-lmi_eea_poco-SKQ1.211006.001-2024-08-06-10-51-50.zip
I made some other tests, and found that it happens only when session replay is enabled (along with tracing, profiling and everything else) - which takes screenshots every second, using bitmap i think.
That would at least mean that there is some obvious difference in bufferqueue/binder usage between the two samples. Isolating this is definitely helpful even if it might not be the culprit.
Here are the tombstones with adb bugreport, if these help in any way
bugreport-lmi_eea_poco-SKQ1.211006.001-2024-08-06-10-51-50.zip
This is super interesting: every time your app crashes in the RenderThread
, immediately before there is another tombstone produced for the media.codec hw/android.hardware.media.omx@1.0-service
which crashes with a null dereference in its HWBinder
thread which is in the opposite buffer producer and also in the getFrameTimestamps()
path. This process is the DEAD_OBJECT
that the binder transaction exception refers to:
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'google/bonito/bonito:10/QQ3A.200805.001/6578210:user/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2024-08-06 10:15:54.152234376+0200
Process uptime: 0s
Cmdline: media.codec hw/android.hardware.media.omx@1.0-service
pid: 32210, tid: 32339, name: HwBinder:32210_ >>> media.codec <<<
uid: 1046
signal 0 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr --------
Cause: null pointer dereference
r0 00000048 r1 e5d52c08 r2 e5d52c18 r3 000039f3
r4 00000000 r5 00007dd2 r6 00000000 r7 e5d52c18
r8 e8ebe274 r9 e5d52c40 r10 e5d52cb0 r11 e5d53060
ip e7cf935c sp e5d52ba0 lr e7ca8ca9 pc e82e6212
backtrace:
#00 pc 0001f212 /apex/com.android.vndk.v30/lib/libui.so (android::FenceTime::Snapshot::getFlattenedSize() const+2) (BuildId: 64c4a3578c87c779e3001f245b9d7a5d)
getsentry/sentry-native#1 pc 00052ca5 /apex/com.android.vndk.v30/lib/libgui.so (android::FrameEventHistoryDelta::getFlattenedSize() const+16) (BuildId: d0ff69429501def0367515e2f266214b)
getsentry/sentry-native#2 pc 00055405 /apex/com.android.vndk.v30/lib/libgui.so (android::conversion::wrapAs(android::hardware::graphics::bufferqueue::V1_0::IGraphicBufferProducer::FrameEventHistoryDelta*, std::__1::vector<std::__1::vector<native_handle*, std::__1::allocator<native_handle*> >, std::__1::allocator<std::__1::vector<native_handle*, std::__1::allocator<native_handle*> > > >*, android::FrameEventHistoryDelta const&)+28) (BuildId: d0ff69429501def0367515e2f266214b)
getsentry/sentry-native#3 pc 0000ce11 /apex/com.android.vndk.v30/lib/libstagefright_bufferqueue_helper.so (android::TWGraphicBufferProducer<android::hardware::graphics::bufferqueue::V1_0::IGraphicBufferProducer, void>::getFrameTimestamps(std::__1::function<void (android::hardware::graphics::bufferqueue::V1_0::IGraphicBufferProducer::FrameEventHistoryDelta const&)>)+128) (BuildId: bcf1848f23152cf7b88c111d515b1f94)
getsentry/sentry-native#4 pc 000172ad /apex/com.android.vndk.v30/lib/android.hardware.graphics.bufferqueue@1.0.so (android::hardware::graphics::bufferqueue::V1_0::BnHwGraphicBufferProducer::_hidl_getFrameTimestamps(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+140) (BuildId: 5e2b8d3c591a311d2f0287a55ee1f2a4)
getsentry/sentry-native#5 pc 00017a8f /apex/com.android.vndk.v30/lib/android.hardware.graphics.bufferqueue@1.0.so (android::hardware::graphics::bufferqueue::V1_0::BnHwGraphicBufferProducer::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+1462) (BuildId: 5e2b8d3c591a311d2f0287a55ee1f2a4)
getsentry/sentry-native#6 pc 0005d01d /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+48) (BuildId: 3445ac5422fc45c8888fee3c33523157)
getsentry/sentry-native#7 pc 0005faaf /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::IPCThreadState::getAndExecuteCommand()+966) (BuildId: 3445ac5422fc45c8888fee3c33523157)
getsentry/sentry-native#8 pc 0006088d /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::IPCThreadState::joinThreadPool(bool)+56) (BuildId: 3445ac5422fc45c8888fee3c33523157)
getsentry/sentry-native#9 pc 0006c5fd /apex/com.android.vndk.v30/lib/libhidlbase.so (android::hardware::PoolThread::threadLoop()+12) (BuildId: 3445ac5422fc45c8888fee3c33523157)
getsentry/sentry-native#10 pc 0000eed9 /apex/com.android.vndk.v30/lib/libutils.so (android::Thread::_threadLoop(void*)+168) (BuildId: 373fcfc8fb18977f88e89ad09552a738)
getsentry/sentry-native#11 pc 0000ea15 /apex/com.android.vndk.v30/lib/libutils.so (thread_data_t::trampoline(thread_data_t const*)+256) (BuildId: 373fcfc8fb18977f88e89ad09552a738)
getsentry/sentry-native#12 pc 000a8e57 /apex/com.android.runtime/lib/bionic/libc.so (__pthread_start(void*)+40) (BuildId: 14ccc210ec59d35990c4377f0f48f77e)
getsentry/sentry-native#13 pc 00061dd3 /apex/com.android.runtime/lib/bionic/libc.so (__start_thread+30) (BuildId: 14ccc210ec59d35990c4377f0f48f77e)
What i don't understand is why it happens only when sentry-native is also enabled. If I use sentry-android-core - that doesn't include sentry-native - everything works fine.
I can't say at this point. It seems it is reporting an honest crash and that crash is caused outside the the sample process (but initiated from your RenderThread
), so I am not really sure how it could affect it.
When exactly does the crash happen? Right at the start? After some time? Do you interact with the app in some way? When you don't include the Native SDK, do you see anything in the logcat message that hints at the omx media service?
Another thing I find curious is why the that services is contacted in the first place, does the sample have any media playback or recording in use? Are the session replay frames encoded at some point using OpenMax?
When exactly does the crash happen?
Between 1-5 seconds after the start
Do you interact with the app in some way?
Nope, i don't do anything at all
do you see anything in the logcat message that hints at the omx media service?
I don't think so, but here is the full logcat (copied without any filter) logcat.txt
does the sample have any media playback or recording in use?
The only thing that comes to my mind is session replay
Are the session replay frames encoded at some point using OpenMax?
There is no reference to OpenMax in the code. Also, the replays don't use native code at all, only system APIs
But if the crashing process is hw/android.hardware.media.omx@1.0-service
, we shouldn't crash, right? Or am i missing something here?
There was some confusion here on my side, sorry about it.
sentry-android
also pulls in sentry-android-replay
, other than sentry-native
This means that when i removed sentry-android
in favor of sentry-android-core
i also removed session replays.
Trying again with sentry-android-core
(without native) and sentry-android-replay
make it crash again. Probably an encoder issue in the replay side.
Moving this issue away from sentry-native
Description
When i try to run our sample app on my physical device (Xiaomi) it crashes with this message:
Status.cpp:143] Failed HIDL return status not checked. Usually this happens because of a transport error (error parceling, binder driver, or from unparceling). If you see this in code calling into "Bn" classes in for a HAL server process, then it is likely that the code there is returning transport errors there (as opposed to errors defined within its protocol). Error is: Status(EX_TRANSACTION_FAILED): 'DEAD_OBJECT: '
Crashes are captured by our SDK, so we can see them here.If i remove the NDK integration, no crashes occur. This happens only on my device, so perhaps specific to Xiaomi?