Crashpad Handler Unintended Behaviors after installation of Sentry

JLuse commented 1 year ago

Description

A customer has reported the following issues only after updating one of their products (the host) to use Sentry.

Issue 1: When the crashpad_handler process is terminated (e.g., using kill -TERM ), the client repetitively attempts to restart the handler, leading to consistent failures.

Issue 2: On teardown of the plugin's crashpad_handler instance, it inadvertently terminates the host process's crashpad_handler. This behavior is observed even when manually terminating the plugin's handler (which is confirmed to be a separate process with a distinct PID); soon after, the host's crashpad handler is also terminated.

Prior to this, when both the customer indicated the host and the plugin were directly using Crashpad for crash reporting, these issues were not observed.

[ ] During build
[X] During run-time
[ ] When capturing a hard crash

Steps To Reproduce & Log output

Customer has provided two repro apps as well as accompanying logs within the support ticket: https://sentry.zendesk.com/agent/tickets/105906

supervacuus commented 1 year ago

Hi @JLuse, I need help understanding the issue (without access to Zendesk for context).

Issue 1: When the crashpad_handler process is terminated (e.g., using kill -TERM ), the client repetitively attempts to restart the handler, leading to consistent failures.

Why is it necessary to manually kill the crashpad_handler? Does the customer expect things to work when the crashpad_handler gets killed from the shell? Because I am unaware of support for this.

Issue 2: On teardown of the plugin's crashpad_handler instance, it inadvertently terminates the host process's crashpad_handler. This behavior is observed even when manually terminating the plugin's handler (which is confirmed to be a separate process with a distinct PID); soon after, the host's crashpad handler is also terminated.

This one needs to be clarified, too: which plugin are you referring to? Is this happening in the context of Unity or Unreal? The words "plugin" and "host" do not refer to anything specific in sentry-native terminology, so this is either usage from the customer domain or from one of the downstream SDKs.

CC: @kahest, @bitsandfoxes

willmbaker commented 12 months ago

Hi @supervacuus, I may be able to help you with some of those specifics ...

Why is it necessary to manually kill the crashpad_handler? Does the customer expect things to work when the crashpad_handler gets killed from the shell? Because I am unaware of support for this.

We're using Crashpad in the context of an audio plugin, loaded within a DAW. The reason that we are killing the handler so that when the last plugin instance is removed from the DAW we also remove any process that the first instance may have spun up.

This one needs to be clarified, too: which plugin are you referring to? Is this happening in the context of Unity or Unreal? The words "plugin" and "host" do not refer to anything specific in sentry-native terminology, so this is either usage from the customer domain or from one of the downstream SDKs.

"Plugin" and "Host" here refer to a VST3 or AU audio plugin running within a host process, typically a DAW of some description.

The problem seems to be that while two Crashpad systems appeared to be able to run side-by-side in this fashion, after upgrading to Sentry the plugin (which continues to use vanilla Crashpad) when pulling down its own handler it somehow (and mysteriously to someone like me who doesn't know Mach-O well) it also pulls down the Sentry handler.

This then causes the further problem that once the host's Sentry crashpad_handler has been pulled down somehow with the other one, the host then gets stuck in a permanent loop attempting to execute a new handler which subsequently causes us some serious performance problems (leading to audio drop-outs and degradation of the user's audio).

Hopefully that helps? I understand that it is hardly a standard set-up, but was hoping to get some guidance on what exactly the Mach-O interaction is that causes both handlers to die.

Cheers, Will Baker.

kahest commented 12 months ago

Hi @willmbaker, thank you for the context, we'll discuss and follow up here!

supervacuus commented 12 months ago

Hi @willmbaker, thank you for the additional information. However, before we can further discuss this internally, I need to ensure that we know exactly what is happening.

We're using Crashpad in the context of an audio plugin, loaded within a DAW. The reason that we are killing the handler so that when the last plugin instance is removed from the DAW we also remove any process that the first instance may have spun up.

Okay, I understand. There can be multiple plugin instances loaded in the DAW process. Each instance relies on crashpad for error-reporting. The first instance started, spins up the crashpad_handler, and the last closed, kills it. All plugin instances share a crashpad client (i.e., a crashpad::CrashpadClient instance), correct?

The problem seems to be that while two Crashpad systems appeared to be able to run side-by-side in this fashion, after upgrading to Sentry the plugin (which continues to use vanilla Crashpad) when pulling down its own handler it somehow (and mysteriously to someone like me who doesn't know Mach-O well) it also pulls down the Sentry handler.

Here, I have trouble following. So, previously, you were running multiple (vanilla) crashpad clients (and related handlers) within the same process space? One for the host and one (or multiple) for the plugin instances? Then you switched to Sentry's Native SDK, but only for one of those clients (the one in the host)? The plugin client(s) still run(s) with vanilla crashpad?

Can you elaborate on the choice to run multiple crashpad clients? Is it because the plugins can run in other DAWs, and so are deployed by default with a separate crashpad configuration, but you also develop a DAW, and now, when you run your plugins in your own DAW, you end up in this situation? Or do you require an isolation between the reports?

This then causes the further problem that once the host's Sentry crashpad_handler has been pulled down somehow with the other one, the host then gets stuck in a permanent loop attempting to execute a new handler which subsequently causes us some serious performance problems (leading to audio drop-outs and degradation of the user's audio).

This one I can answer right away: we start the crashpad_handler with the restartable flag set to true. This means that a thread in the crashpad client observes whether the crashpad_handler was killed and attempts to restart it in this case.

This is why your host process gets stuck when killing the handler. I can imagine that two or more clients starting multiple handlers could confuse this mechanism and may explain why you see the interaction when killing one handler affects the other.

To boil this down:

should we expose the restartable crashpad_handler in our options? this seems to be the most actionable response, but it means we expose yet another backend specific interface. This is related with the recurring wish of exposing the database pruning rules for crashpad.
can we support multiple (vanilla) crashpad client instances running in tandem with the Native SDK?
Are there other options that would prevent multiple clients in the same process?

willmbaker commented 11 months ago

All plugin instances share a crashpad client (i.e., a crashpad::CrashpadClient instance), correct?

That is correct.

Then you switched to Sentry's Native SDK, but only for one of those clients (the one in the host)? The plugin client(s) still run(s) with vanilla crashpad?

Yup, that's it.

Can you elaborate on the choice to run multiple crashpad clients?

It is as you say, our plugin (which existed before the DAW) may run in the context of other DAWs, not just ours.

This means that a thread in the crashpad client observes whether the crashpad_handler was killed and attempts to restart it in this case. This is why your host process gets stuck when killing the handler.

That's interesting ... this makes sense that it would get torn down and then start again, however the behaviour we're seeing is that is it stuck in an infinite loop trying to restart, but it never does. If it correctly restarted then we could easily live with the behaviour of it being torn down when the last plugin is closed, however what it actually does is continually spins up the crashpad_handler process and will never stop.

should we expose the restartable crashpad_handler in our options?

I think that restarting when it dies is good behaviour, but it should just restart the one time.

Can we support multiple (vanilla) crashpad client instances running in tandem with the Native SDK?

This was the big question for me, as my understanding of Mach-O is not so good that I can determine through what action one process is being pulled down because another, unrelated, process is killed.

I would also like to emphasize the restarting behaviour ... I'm not sure if you have access to the minimal reproduction case that I sent in (I've attached a bare Git repository of it here sentry.zip ) ... however you can observe this simply by running any application with a crashpad_handler process and then killing that process. You can see from the output Sentry sends to stdout that it is stuck in a loop attempting to restart the process but never managing to do it. This behaviour seems to me a clear defect, certainly we could live with the handler being shut down and then restarted if it was working.

Cheers, and thanks for the help so far!

Will.

supervacuus commented 11 months ago

Thanks for the details @willmbaker! I haven't forgotten to respond, but this was a more elaborate investigation. I will write a summary of the situation and possible solutions approaches either today or tomorrow. Thanks for your patience.

willmbaker commented 11 months ago

No problem at all, thanks for your attention! :)

supervacuus commented 11 months ago

That's interesting ... this makes sense that it would get torn down and then start again, however the behaviour we're seeing is that is it stuck in an infinite loop trying to restart, but it never does. If it correctly restarted then we could easily live with the behaviour of it being torn down when the last plugin is closed, however what it actually does is continually spins up the crashpad_handler process and will never stop.

There are multiple issues at play here. Let's focus on one thing first:

I was able to reproduce the described behavior on macOS. The restart thread continuously tries to restart the crashpad_handler, but the newly started crashpad_handler fails when it tries to request a no-senders notification, then it shuts down, and ~1s later, the restarter thread tries to start it again and so on.

The logs you see come directly from Crashpad, not the Native SDK. Further, I could reproduce the issue in upstream Crashpad from the main branch (w/o any involvement of Sentry's Native SDK).

This means the problem is in the implementation of the restarter thread upstream. I have not yet dived into the "why," but something has probably changed in the behavior of mach-ports in recent macOS versions. Which macOS version are you testing this on?

should we expose the restartable crashpad_handler in our options?

I think that restarting when it dies is good behaviour, but it should just restart the one time.

Since no interface allows you to shut down the restarter thread, in this configuration, you will never be able to terminate the crashpad_handler. Whether or not the restart works on the first attempt is orthogonal to the issue. But stopping the crashpad_handler wouldn't make sense in that case.

As far as I understood, your intention isn't to kill the crashpad_handler of the Native SDK at all, so that seems to be only a side-quest anyway.

Your problem that terminating the crashpad_handler started for your plugin also terminates the handler started for the Native SDK has to do with the Mach-port handling of the core message server in the crashpad_handler and the client's connection to it.

In short, the crashpad client is stateful (referencing the exception port), and if we do not keep it around during the lifecycle of the Native SDK, killing the crashpad_handler of any following client instance will also tear down our crashpad_handler.

This is also the case when you start two handlers from one client (and, again, I was able to replicate this with vanilla crashpad). We didn't need to keep the client around before, but that could solve your problem.

Can we support multiple (vanilla) crashpad client instances running in tandem with the Native SDK?

This was the big question for me, as my understanding of Mach-O is not so good that I can determine through what action one process is being pulled down because another, unrelated, process is killed.

In general, all crash-handling/signaling mechanisms register per process (this is the same on Windows with its various handlers, Linux/Unix signal handlers, and mach exception ports on macOS). The topic is more complicated when you consider multiple threads. However, these threads still share the same singular handler implementation.

Whether installing multiple handlers "works" depends on the order of initialization, whether the mechanism allows the safe invocation of previously installed handlers, and whether the handlers cooperate.

Most of the time, one handler will overwrite the other and then be the sole crash-reporting mechanism (on any OS). Point in case (on macOS): if you start two different crashpad handlers via respective client instances in a single process, only the second one will report a crash. Further, if you then terminate the crashpad_handler of the second client, no crash will be reported. Also, starting two handlers (for instance, with different DSNs) from a single client will deliver to only one endpoint (to be clear: all of this is true for vanilla crashpad, without Sentry's Native SDK).

I understand the situation in which you are using two handlers, but be aware that this might lead to hard-to-diagnose behavior (independent of whether terminating a crashpad_handler works or not). I would recommend checking in the plugin initialization to see if you are running in your own DAW; in that case, please don't initialize Crashpad in the plugin.

Starting Crashpad in the plugin when loaded in another DAW can be even more problematic because your handler could overwrite the behavior of another crash handler installed by your host, which might report crashes that have nothing to do with your plugin. Still, the developer of the host could miss relevant crashes because you received them instead.

Summary of the issues:

two or more crashpad instances + handlers running in the same process are problematic. Even if you haven't had any issues, at least one of them will fail to report crashes on macOS.
upstream Crashpad has a problem with restarting the handler on macOS. This bug should be fixed (ideally by Google), but it is orthogonal to your problem (but I can see why it makes things worse).
you don't need the restarting feature, and we could expose turning it off or disabling it on macOS altogether, but...
the actual problem you have is that shutting down the crashpad_handler of your plugins also shuts down the crashpad_handler of the host (which starts through the Native SDK). That doesn't happen in vanilla crashpad (if you keep two separate client instances around), and if that didn't happen when using the Native SDK, 2.) and 3.) would no longer be an issue for you. We can fix this in the Native SDK, although I strongly suggest reviewing your setup's multiple client/handler architecture.

willmbaker commented 11 months ago

Great, thank you @supervacuus. So what I'm hearing is that you don't believe that it is possible to engage more than one crashpad_handler in a single process? This makes some kind of sense, I suppose, and is definitely the case on Windows though our observations on macOS (which is the only platform on which try to have our plugins register a crash handler) initially were that they did seem to function together.

That said, I'm willing to accept your explanation and particularly as you say it happens in vanilla Crashpad and may be a change in Mach-O behaviour ... we've been seeing it on our automation machines which I believe are Catalina at the moment.

However, at point 4 there, are you saying that you do think it would be possible to change the current behaviour of the plugin pulling down the host handler? At point 1, however, am I correct that you're saying that you don't think that crash reports would be sent for both handlers?

I hear what you're saying on reviewing the architecture, and will do so, as I feel like we predominately get crashes from other plugins running in the same DAW so it has been of limited utility, to a certain degree, we tend to catch the crashes in automated tests before we ship via the host's crash reporting anyway.

Again, thanks for your help and attention :)

supervacuus commented 11 months ago

Great, thank you @supervacuus. So what I'm hearing is that you don't believe that it is possible to engage more than one crashpad_handler in a single process?

It's possible to start two or more clients/handlers, but as described, they probably won't do what you'd expect.

However, at point 4 there, are you saying that you do think it would be possible to change the current behaviour of the plugin pulling down the host handler?

We can fix this and keep the client instance during the Native SDK life cycle (with it, the send right to the exception port). Incidentally, this would prevent our crashpad_handler from shutting down when you terminate the one in the plugin. But on macOS, the Native SDK still won't be able to send a crash report if the plugin's client/handler was initialized after sentry_init().

At point 1, however, am I correct that you're saying that you don't think that crash reports would be sent for both handlers?

Yes, only the last client/handler pair that was instantiated on macOS will send crash reports.

I hear what you're saying on reviewing the architecture, and will do so, as I feel like we predominately get crashes from other plugins running in the same DAW so it has been of limited utility, to a certain degree, we tend to catch the crashes in automated tests before we ship via the host's crash reporting anyway.

Yeah, I can imagine this to be the case. Ideally, you want some form of in-process isolation, but none of the current crash-handling mechanisms (on any platform) can provide that directly (they are all process-global). Isolation can happen only as a post-processing step in the backend, depending on the meta-data available (or solely on the stack trace of the crashing thread).

Our current "scope" implementation in the Native SDK doesn't support multiple execution contexts. This might not help in an architecture where multiple plugins aren't running in full (thread) isolation but, for instance, are scheduled to run from a small pool of threads. But even in a scenario where you'd have better scoping support, it would still have to be one global handler taking care of all crashes.

Again, thanks for your help and attention :)

You're very welcome.

willmbaker commented 11 months ago

Great, thanks for your help @supervacuus ... We are going to re-think our use of crash reporting in plugins after your advice. I'm happy if you close this as not a bug or similar, thank you so much for your time :)

supervacuus commented 11 months ago

@willmbaker, we added the fix via https://github.com/getsentry/sentry-native/pull/910. It will be part of the next release.

supervacuus commented 10 months ago

The fix is part of the 0.7.0 release.

getsentry / sentry-native

Crashpad Handler Unintended Behaviors after installation of Sentry #900