Open richardhozak opened 1 year ago
I wonder if it's also worth adding the following to: src/transports/sentry_transport_curl.c
curl_easy_setopt(curl, CURLOPT_NOSIGNAL, (long)1);
References: https://curl.se/libcurl/c/CURLOPT_NOSIGNAL.html https://stackoverflow.com/questions/21887264/why-libcurl-needs-curlopt-nosignal-option-and-what-are-side-effects-when-it-is
I think that's a good point, following the links there is also https://curl.se/libcurl/c/threadsafe.html which says:
Signals are used for timing out name resolves (during DNS lookup) - when built without using either the c-ares or threaded resolver backends. When using multiple threads you should set the CURLOPT_NOSIGNAL option to 1L for all handles. Everything will or might work fine except that timeouts are not honored during the DNS lookup - which you can work around by building libcurl with c-ares or threaded-resolver support. c-ares is a library that provides asynchronous name resolves. On some platforms, libcurl simply will not function properly multi-threaded unless the CURLOPT_NOSIGNAL option is set.
and continues:
When CURLOPT_NOSIGNAL is set to 1L, your application needs to deal with the risk of a SIGPIPE (that at least the OpenSSL backend can trigger). Note that setting CURLOPT_NOSIGNAL to 0L will not work in a threaded situation as there will be race where libcurl risks restoring the former signal handler while another thread should still ignore it.
So by the recommendation CURLOPT_NOSIGNAL
should probably be set to 1
as anything linking sentry-native
would become multithreaded with sentry-native
spawning background thread.
The remaining problem is that main thread could then receive SIGPIPE
and if not handled, would crash the app.
Investigating a bit further lead me to AsynchDNS
feature of curl/libcurl which can be queried with curl --version
which seems to be enabled by default and enables the threaded resolver or if available, the c-ares resolver. This goes hand in hand with the other recommendation that multithreaded apps should use libcurl with c-ares or threaded-resolver support and would resolve the remaining problem as it seems to be built with threaded resolver by default.
Thanks, @zxey. It was only a matter of time before a use-case raised this issue with more elaborate signal handling requirements. You can certainly prepare a PR for this. Our signal handlers don't have to run in any background workers by design, but we might need to discuss/evaluate the consequences in the scenario where the handlers of a crash-backend are active. As a first step, I could imagine providing a compile-time flag only considered when SENTRY_BACKEND=none
.
But the issues mentioned in the comments are not entirely overlapping and should be handled separately or only where they interact. Specifically, we have to consider how this interacts with CURLOPT_NOSIGNAL
. While I also think that this is a bug in the current configuration and should probably be set to 1
(thanks, @past-due), if we disable the worker threads for signal-handler invocation (as @zxey proposed), then I don't see how the described race would even be possible.
We currently do not control the build of libcurl
used by our users as a dependency, so I can only imagine providing compile-time flags (or checking for libcurl
features in CMake if that is enough) for users who know what they are doing so that our current and future implementation doesn't interfere with respective resolvers.
Sorry I have been sick for the last two weeks. I also had a deeper look into the topic and came to the following insights:
libcurl
package that did not have the AsynchDNS
feature enabled in any of the last 3 major releases of Fedora, Debian, Ubuntu10.15.7
), where the bundled curl
is at version 7.64.1
(release date 2019-03-27), which also has the feature enabled. This release date also coincides with the first commits in this repository.curl
is bundled. This is irrelevant to the Native SDK since we do not use libcurl
as a transport dependency on Windows, but it still supports my case ;-)AsynchDNS
as a feature has been in libcurl
since version 7.10.7, which dates back to August 2003sentry-native
was ever developed or tested against a libcurl
in which AsynchDNS
was not enabled.AsynchDNS
is enabled in libcurl
, CURLOPT_NOSIGNAL
has no effect because signal handling is no longer used (neither signal-handler setup nor signal raise for both SIG_PIPE
and SIG_ALRM
).This leads me to conclude that it doesn't make much sense to invest time configuring CURLOPT_NOSIGNAL
or handling any potentially escaping signals due to that configuration.
It also means there is no bug in the current transport implementation except for a vanishingly small portion of the libcurl
install base (though I would be interested in concrete counter-examples).
This also answers why we never got any bugs reported where people see the SIGPIPE
/SIGALRM
signals (both have Term
disposition) terminating their programs or interfering with any of the currently installed signal handlers.
The only immediate sensible action would be to require the AsynchDNS
feature both at build- and runtime to be more explicit about the currently implicit assumption in our implementation. I will create a PR for this.
Regarding the initial suggestion from @rhzx86: this mostly makes sense as a build-time configuration where the default should be to not mask the signals because it would also affect the thread-directed signals (like SIGSEGV
): this means that any thread-directed signal caused by one of our threads or its dependencies with Term
disposition will terminate the program, but the SDK will not report the crash to the backend. This is not a good default since it would surprise most users.
CC: @bitsandfoxes, @markushi, @Swatinem
Description
In our application we use epoll to manage our main application loop and we also let it handle plethora of linux signals as opposed to using signal handlers. In order to handle linux signal with epoll you need to block the signal and pass it to
singalfd
which is in turn passed to epoll to check if signal is waiting to be handled. The signals need to be blocked in main thread so they do not crash our application as they are not handled the traditional way (using signal handlers).The problem arises when there is another thread running which does not have the signals blocked, because when linux kernel sends any kind of signal that is blocked to our app, it sees that the main thread has the signal blocked and thus chooses to send the signal to any thread that does not have the signal blocked and if not handled in that specific thread the whole app crashes.
When using the
curl
transport backend provided by sentry-native library, whensentry_init
is called it creates bgworker thread for sending any events tosentry.io
withlibcurl
and this thread does not have any signals blocked, meaning the linux kernel decides to send signal to that thread, crashing our app.When does the problem happen
Environment
We do not use any crash handing backend (SENTRY_BACKEND=none) yet and just use the event reporting facilities of sentry-native.
Steps To Reproduce
As providing the steps to reproduce are quite involved and harder to provide than the actual solution to this problem I think it is best to demonstrate the possible solution that can be implemented in
sentry-native
to mitigate this problem.This is a possible implementation of
sentry__thread_spawn
that would mitigate the issue described above:Workaround Steps
For anyone who runs into this issue, the workaround to the issue is the same as the solution provided above, but instead of
pthread_create()
you need to callsentry_init()
, it is basically the same solution but one level higher (you block all signals, callsentry_init()
, all threads created bysentry_init()
get the mask inherited and then you restore the original mask).Log output
Log output should not be relevant but here it is for completion.
Output
``` [sentry] INFO using database path "/some/path/.sentry-native" [sentry] DEBUG starting transport [sentry] DEBUG starting background worker thread [sentry] DEBUG processing and pruning old runs [sentry] DEBUG background worker thread started [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG executing task on worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG trying to read modules from /proc/self/maps [sentry] DEBUG read 39 modules from /proc/self/maps [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread * Trying 34.120.195.249:443... * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=*.ingest.sentry.io * start date: Dec 18 01:07:14 2022 GMT * expire date: Mar 18 01:07:13 2023 GMT * subjectAltName: host "o504248.ingest.sentry.io" matched cert's "*.ingest.sentry.io" * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL certificate verify ok. * Using HTTP2, server supports multiplexing * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * Using Stream ID: 1 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:313 * We are completely uploaded and fine * old SSL session ID is stale, removing < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 2 < access-control-expose-headers: x-sentry-rate-limits, x-sentry-error, retry-after < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * 17 bytes stray data read before trying h2 connection * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 3 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9816 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: retry-after, x-sentry-error, x-sentry-rate-limits < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"cd06982692ce48e5cc93cddb5c946a25"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 5 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9798 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-rate-limits, retry-after, x-sentry-error < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"4080458c40e442cf2fb6b6feb1ba05a2"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 7 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9811 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-rate-limits, x-sentry-error, retry-after < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"2a0d91f0097a43030d1244d7a96108f9"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * 17 bytes stray data read before trying h2 connection * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: 9 (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9798 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-rate-limits, retry-after, x-sentry-error < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"380d35d9849246b4e7128688e8d0b2bc"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG executing task on worker thread * Found bundle for host o504248.ingest.sentry.io: 0x7efda8007c00 [can multiplex] * 17 bytes stray data read before trying h2 connection * Re-using existing connection! (#0) with host o504248.ingest.sentry.io * Connected to o504248.ingest.sentry.io (34.120.195.249) port 443 (#0) * Using Stream ID: b (easy handle 0x56293389aab0) > POST /api/5591753/envelope/ HTTP/2 Host: o504248.ingest.sentry.io user-agent: sentry.native/0.5.4 accept: */* x-sentry-auth:Sentry sentry_key=6ae9408c92aa4fde82862d32ac9deba5, sentry_version=7, sentry_client=sentry.native/0.5.4 content-type:application/x-sentry-envelope content-length:9811 * We are completely uploaded and fine < HTTP/2 200 < server: nginx < date: Thu, 09 Feb 2023 14:25:53 GMT < content-type: application/json < content-length: 41 < access-control-expose-headers: x-sentry-error, x-sentry-rate-limits, retry-after < vary: Origin < x-envoy-upstream-service-time: 0 < strict-transport-security: max-age=31536000; includeSubDomains; preload < via: 1.1 google < alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000 < {"id":"b77aa15dda92437e990f67e24aea5563"}* Connection #0 to host o504248.ingest.sentry.io left intact [sentry] DEBUG merging scope into event [sentry] DEBUG adding attachments to envelope [sentry] DEBUG sending envelope [sentry] DEBUG submitting task to background worker thread [sentry] DEBUG executing task on worker thread * 17 bytes stray data read before trying h2 connection # THIS is where we send signal to our linux application that is expected in mainloop, but received by sentry bgworker thread, crashing the app 15:25:56: app crashed. ```
If desired, we can provide corresponding PR, fixing this issue. Would you be open to such contributions? Do you think this is a good solution to our problem and/or could it be solved in a different way?