getsentry / sentry-native

Sentry SDK for C, C++ and native applications.
MIT License
391 stars 165 forks source link

Crash on arbitrary signal that is blocked on main thread #806

Open richardhozak opened 1 year ago

richardhozak commented 1 year ago

Description

In our application we use epoll to manage our main application loop and we also let it handle plethora of linux signals as opposed to using signal handlers. In order to handle linux signal with epoll you need to block the signal and pass it to singalfd which is in turn passed to epoll to check if signal is waiting to be handled. The signals need to be blocked in main thread so they do not crash our application as they are not handled the traditional way (using signal handlers).

The problem arises when there is another thread running which does not have the signals blocked, because when linux kernel sends any kind of signal that is blocked to our app, it sees that the main thread has the signal blocked and thus chooses to send the signal to any thread that does not have the signal blocked and if not handled in that specific thread the whole app crashes.

When using the curl transport backend provided by sentry-native library, when sentry_init is called it creates bgworker thread for sending any events to sentry.io with libcurl and this thread does not have any signals blocked, meaning the linux kernel decides to send signal to that thread, crashing our app.

When does the problem happen

Environment

We do not use any crash handing backend (SENTRY_BACKEND=none) yet and just use the event reporting facilities of sentry-native.

Steps To Reproduce

As providing the steps to reproduce are quite involved and harder to provide than the actual solution to this problem I think it is best to demonstrate the possible solution that can be implemented in sentry-native to mitigate this problem.

This is a possible implementation of sentry__thread_spawn that would mitigate the issue described above:

// block all signals, so threads created by sentry_init do not receive any signals,
// this way it does not intervene with signal handling on main thread
struct sigset_t oldset;
struct sigset_t allset;
sigfillset(&allset);
pthread_sigmask(SIG_SETMASK, &allset, &oldset);

pthread_create(...);

// restore signal mask back to 'oldset', so we do not mess
// with the rest of the program running on main thread
pthread_sigmask(SIG_SETMASK, &oldset, nullptr);

Workaround Steps

For anyone who runs into this issue, the workaround to the issue is the same as the solution provided above, but instead of pthread_create() you need to call sentry_init(), it is basically the same solution but one level higher (you block all signals, call sentry_init(), all threads created by sentry_init() get the mask inherited and then you restore the original mask).

Log output

Log output should not be relevant but here it is for completion.

Output

``` [sentry] INFO using database path "/some/path/.sentry-native" [sentry] DEBUG starting transport [sentry] DEBUG starting background worker thread [sentry] DEBUG processing and pruning old runs [sentry] DEBUG background worker thread started [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG executing task on worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG trying to read modules from /proc/self/maps [sentry] DEBUG read 39 modules from /proc/self/maps [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread * Trying 34.120.195.249:443... * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=*.ingest.sentry.io * start date: Dec 18 01:07:14 2022 GMT * expire date: Mar 18 01:07:13 2023 GMT * subjectAltName: host "o504248.ingest.sentry.io" matched cert's "*.ingest.sentry.io" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok. * Using HTTP2, server supports multiplexing * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * Using Stream ID: 1 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:313 * We are completely uploaded and fine * old SSL session ID is stale, removing < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 2 < access-control-expose-headers: x-sentry-rate-limits, x-sentry-error, retry-after < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * 17 bytes stray data read before trying h2 connection * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 3 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9816 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: retry-after, x-sentry-error, x-sentry-rate-limits < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"cd06982692ce48e5cc93cddb5c946a25"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 5 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9798 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-rate-limits, retry-after, x-sentry-error < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"4080458c40e442cf2fb6b6feb1ba05a2"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 7 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9811 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-rate-limits, x-sentry-error, retry-after < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"2a0d91f0097a43030d1244d7a96108f9"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * 17 bytes stray data read before trying h2 connection * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 9 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9798 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-rate-limits, retry-after, x-sentry-error < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"380d35d9849246b4e7128688e8d0b2bc"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * 17 bytes stray data read before trying h2 connection * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: b (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9811 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-error, x-sentry-rate-limits, retry-after < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"b77aa15dda92437e990f67e24aea5563"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG executing task on worker thread * 17 bytes stray data read before trying h2 connection # THIS is where we send signal to our linux application that is expected in mainloop, but received by sentry bgworker thread, crashing the app 15:25:56: app crashed. ```


If desired, we can provide corresponding PR, fixing this issue. Would you be open to such contributions? Do you think this is a good solution to our problem and/or could it be solved in a different way?

past-due commented 1 year ago

I wonder if it's also worth adding the following to: src/transports/sentry_transport_curl.c

curl_easy_setopt(curl, CURLOPT_NOSIGNAL, (long)1);

References: https://curl.se/libcurl/c/CURLOPT_NOSIGNAL.html https://stackoverflow.com/questions/21887264/why-libcurl-needs-curlopt-nosignal-option-and-what-are-side-effects-when-it-is

richardhozak commented 1 year ago

I think that's a good point, following the links there is also https://curl.se/libcurl/c/threadsafe.html which says:

Signals are used for timing out name resolves (during DNS lookup) - when built without using either the c-ares or threaded resolver backends. When using multiple threads you should set the CURLOPT_NOSIGNAL option to 1L for all handles. Everything will or might work fine except that timeouts are not honored during the DNS lookup - which you can work around by building libcurl with c-ares or threaded-resolver support. c-ares is a library that provides asynchronous name resolves. On some platforms, libcurl simply will not function properly multi-threaded unless the CURLOPT_NOSIGNAL option is set.

and continues:

When CURLOPT_NOSIGNAL is set to 1L, your application needs to deal with the risk of a SIGPIPE (that at least the OpenSSL backend can trigger). Note that setting CURLOPT_NOSIGNAL to 0L will not work in a threaded situation as there will be race where libcurl risks restoring the former signal handler while another thread should still ignore it.

So by the recommendation CURLOPT_NOSIGNAL should probably be set to 1 as anything linking sentry-native would become multithreaded with sentry-native spawning background thread.

The remaining problem is that main thread could then receive SIGPIPE and if not handled, would crash the app.

Investigating a bit further lead me to AsynchDNS feature of curl/libcurl which can be queried with curl --version which seems to be enabled by default and enables the threaded resolver or if available, the c-ares resolver. This goes hand in hand with the other recommendation that multithreaded apps should use libcurl with c-ares or threaded-resolver support and would resolve the remaining problem as it seems to be built with threaded resolver by default.

supervacuus commented 1 year ago

Thanks, @zxey. It was only a matter of time before a use-case raised this issue with more elaborate signal handling requirements. You can certainly prepare a PR for this. Our signal handlers don't have to run in any background workers by design, but we might need to discuss/evaluate the consequences in the scenario where the handlers of a crash-backend are active. As a first step, I could imagine providing a compile-time flag only considered when SENTRY_BACKEND=none.

But the issues mentioned in the comments are not entirely overlapping and should be handled separately or only where they interact. Specifically, we have to consider how this interacts with CURLOPT_NOSIGNAL. While I also think that this is a bug in the current configuration and should probably be set to 1 (thanks, @past-due), if we disable the worker threads for signal-handler invocation (as @zxey proposed), then I don't see how the described race would even be possible.

We currently do not control the build of libcurl used by our users as a dependency, so I can only imagine providing compile-time flags (or checking for libcurl features in CMake if that is enough) for users who know what they are doing so that our current and future implementation doesn't interfere with respective resolvers.

supervacuus commented 1 year ago

Sorry I have been sick for the last two weeks. I also had a deeper look into the topic and came to the following insights:

This leads me to conclude that it doesn't make much sense to invest time configuring CURLOPT_NOSIGNAL or handling any potentially escaping signals due to that configuration.

It also means there is no bug in the current transport implementation except for a vanishingly small portion of the libcurl install base (though I would be interested in concrete counter-examples).

This also answers why we never got any bugs reported where people see the SIGPIPE/SIGALRM signals (both have Term disposition) terminating their programs or interfering with any of the currently installed signal handlers.

The only immediate sensible action would be to require the AsynchDNS feature both at build- and runtime to be more explicit about the currently implicit assumption in our implementation. I will create a PR for this.

Regarding the initial suggestion from @rhzx86: this mostly makes sense as a build-time configuration where the default should be to not mask the signals because it would also affect the thread-directed signals (like SIGSEGV): this means that any thread-directed signal caused by one of our threads or its dependencies with Term disposition will terminate the program, but the SDK will not report the crash to the backend. This is not a good default since it would surprise most users.

CC: @bitsandfoxes, @markushi, @Swatinem