getsentry / sentry-java

A Sentry SDK for Java, Android and other JVM languages.
https://docs.sentry.io/
MIT License
1.16k stars 435 forks source link

`pthread_join` failed for sampling thread shutdown: Invalid argument #3561

Open kahest opened 4 months ago

kahest commented 4 months ago

Description

Potential crash when stopping Profiling, more context in internal customer case doc.

This is likely an issue inside of the Android Profiler that our SDK uses for Profiling, similar to https://github.com/getsentry/sentry-java/issues/2604.

The stack trace associated with this issue looks like this:

/apex/com.android.runtime/lib64/bionic/libc.so (abort+164)
/apex/com.android.art/lib64/libart.so (art::Runtime::Abort(char const*)+904)
/apex/com.android.art/lib64/libbase.so (android::base::SetAborter(std::__1::function<void (char const*)>&&)::$_0::__invoke(char const*)+80)
/apex/com.android.art/lib64/libbase.so (android::base::LogMessage::~LogMessage()+352)
/apex/com.android.art/lib64/libart.so (art::Trace::StopTracing(bool, bool)+1568)
...
kahest commented 3 months ago

The report we got for this is from a RN app running on Android, which might be a red herring, but we should be aware. FYI @krystofwoldrich

supervacuus commented 3 months ago

Quick update regarding the first stack trace:

What happened here is that

This differs from #2604 because pthread_join can find the thread in Bionic's global thread list. However, the join failed because the thread was already joined, probably because the second thread tried to stop tracing between guards, preventing another attempt to join and reset the guarded state at the end of the function.

What could happen concretely (just reading from the code):

Of course, this can only happen if two threads can call Trace::StopTracing() concurrently, and this isn't prevented on a higher level. But given the already present locks, I imagine this is not the case (another hint is the existence of the num_trace_stops_initiated_ counter).

There can be good reasons why the function doesn't hold the trace_lock_ over the entire scope, but I didn't dive into that further. Maybe this can be fixed by checking the_trace_->stop_tracing_ == true in the first lock scope and returning because it means a "stop" is already in progress.

Another interesting aspect is that sampling_pthread == 0U only means that pthread_join() isn't called, but the execution of the remaining function proceeds. In that case, it might be better to tie the reset of sampling_pthread to pthread_join() (which previously was the case [6]).

Please keep in mind that I am referencing solely the main branch here, and there could certainly be additional ways in which this behavior could emerge if we knew exactly which versions of bionic and the runtime were in use in the crash cases. This is not the investigation being done here 😅. I still think this makes sense because we know where and why this probably happened and the expected behavior in the latest state of the implementation.

References to follow along:

[1] https://cs.android.com/android/_/android/platform/art/+/main:runtime/trace.cc;l=914-919 [2] https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/bionic/pthread_join.cpp;l=53-55 [3] https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/bionic/pthread_create.cpp;l=476-485 [4] https://cs.android.com/android/_/android/platform/art/+/main:runtime/trace.cc;l=858-860 [5] https://cs.android.com/android/_/android/platform/art/+/main:runtime/trace.cc;l=968-969 [6] https://cs.android.com/android/_/android/platform/art/+/71cb074032f1b42349df77fd9d143ff5971b87ae:runtime/trace.cc;dlc=b0bdf29f81eda70342e898ec151d1165e5a758f0

supervacuus commented 3 months ago

I was able to reproduce the issue using a vanilla Android app template (without sentry) that pretty much validates my "concrete" race described above. I used an Android 14 image in the emulator without Google Play to root and (Frida) trace the ART method tracer, too.

Essentially, the following code is enough to trigger the mentioned race, which leads to the same stack trace and "cause" message as seen coming from the Google error reporting (using the default params from package io.sentry.android.core.AndroidProfiler):

Debug.startMethodTracingSampling(null, 3_000_000, 30_000);
Thread.sleep(1000)
val threads = List(10) {
    thread(start = false) {
        Debug.stopMethodTracing()
    }
}
threads.forEach { it.start() }
threads.forEach { it.join() }

Image

I also got another stack trace that is still a pthread_join race in Trace::StopTracing() but curiously ends up aborting from __pthread_internal_find() (like #2604 but with another cause message, can't explain why rn):

Image

While this repro code is not something we'll find in the wild, there is a practical chance that multiple threads will stop tracing concurrently via "scope-closing" abstractions (think spans/txn). The runtime shouldn't crash in these cases but rather log an error (like it does when stop is called on an inactive trace).

I already prepped a patch for the runtime, but I can't send it for review to AOSP Gerrit before my vacation (and given the potential review back-and-forth, it doesn't make sense either). Maybe I can test the patched runtime against the repro before sending it for review. In any case, will do this when I am back.

romtsn commented 3 months ago

I already prepped a patch for the runtime, but I can't send it for review to AOSP Gerrit before my vacation

That's awesome! Even if you do that though, it means profiling is practically unusable until the new OS update ships with your patch included, which is unfortunate. Could you share the patch here, so we could think of applying it via something like https://github.com/bytedance/android-inline-hook for the older OS versions?

stefanosiano commented 3 months ago

that's great! however, in our Android SDK we call the Debug.startMethodTracingSampling and Debug.stopMethodTracing() inside synchronized methods in the AndroidProfiler class, and this should prevent this case when using the Sentry SDK. Would you say it's still possible this happens when the thread.join is called in the native layer, given no other calls to Debug.stopMethodTracing is made by the user?

supervacuus commented 3 months ago

that's great! however, in our Android SDK we call the Debug.startMethodTracingSampling and Debug.stopMethodTracing() inside synchronized methods in the AndroidProfiler class, and this should prevent this case when using the Sentry SDK.

Yes, I've seen that, but since Debug.stopMethodTracing() is a static method that acts on global trace data, nothing prevents code outside the SDK from calling it concurrently to AndroidProfiler.endAndCollect(). Of course, this is something that can be documented in red letters, and Sentry can ask affected customers to look in their code or their dependencies.

Would you say it's still possible this happens when the thread.join is called in the native layer, given no other calls to Debug.stopMethodTracing is made by the user?

In theory, yes, but in contrast to #2604, I would consider it much less likely. In #2604, we don't know whether a particular thread is missing or just any arbitrary thread and whether these threads were created from native code that forgot to detach (all that is still an assumption).

Here, the cause is always the already joined sampling thread of the ART method tracer. While it is possible that some native code would join the sampling thread out of the blue (it is neither created from native client code nor typically interacting with it), I think some other Java code to call Debug.stopMethodTracing() is much more likely.

supervacuus commented 3 months ago

I already prepped a patch for the runtime, but I can't send it for review to AOSP Gerrit before my vacation

That's awesome! Even if you do that though, it means profiling is practically unusable until the new OS update ships with your patch included, which is unfortunate.

It is certainly not great, but I wouldn't go so far as to call it unusable. I think, and please correct me if that assumption is wrong, that most SDK clients will not interact with the method tracing directly (or via one of their other dependencies).

As @stefanosiano mentioned, having synchronized invocations to Debug.stopMethodTracing() entirely eliminates the problem within the SDK. I think that having this behavior documented in an alert box with all features that end up calling AndroidProfiler.endAndCollect() and asking affected users for other potential uses of the ART method tracer in their code would go a long way in preventing this issue from happening.

Could you share the patch here, so we could think of applying it via something like https://github.com/bytedance/android-inline-hook for the older OS versions?

Yes, of course, but please let me test the code before. Adding the fix via hooks won't be trivial since it changes the control flow based on the private state, and I can't think of some blunt wrapping behavior. But let's talk details when I am back.

supervacuus commented 3 months ago

report on the ART issue-tracker: https://issuetracker.google.com/issues/362293861

kahest commented 2 months ago

moving this to blocked while we await feedback from Google

kahest commented 2 months ago

Update: there is some movement on the ART issue tracker: https://issuetracker.google.com/issues/362293861#comment6 Our tests indicate that the fix mentioned in the comment fixes the reproducer we created for this issue