Open meengu opened 6 months ago
I've attached gdb
to a few worker processes and did:
p (unsigned long long)ERR_get_state()
And it returns the same error struct pointer value for each process. This time I tried it with OpenSSL 3.3.0 9 Apr 2024 (Library: OpenSSL 3.3.0 9 Apr 2024)
Also related https://github.com/OpenSIPS/opensips/issues/3284
Which leads me to an idea that this is a race condition.
My observations tend to agree with this. In my testing, I've found that increasing logging to 4 or 5 makes it less likely to crash than if logging is set to -1 (my default).
My related issue: https://github.com/OpenSIPS/opensips/issues/3280
I solved it with the following patch for now: https://megous.com/dl/tmp/0001-Fix-openssl-TLS-data-corruption-in-shared-memory-by-.patch
It's a bit less hacky and conceptually simpler than the solution in kamailio [1] because it uses explicit re-initialization of the shared data based on pointer values. And I did not put much thought into openssl version check, because we use just two openssl versions, but so far no issues.
[1] https://github.com/kamailio/kamailio/commit/77de86909ac8c156335e1e789bb3067932f9eff5
Thanks for the patch and the link; I wonder how this would affect my issue since I'm using OpenSSL 3.0.2 and the patch is written for lower versions.
Unfortunately we're not currently compiling from source but use the OpenSIPs apt, it'll take several days to retool for this and test the patch but given the lack of response on this issue (I raised my ticket in January) we may be forced to go that route imminently.
Thanks for the patch and the link; I wonder how this would affect my issue since I'm using OpenSSL 3.0.2 and the patch is written for lower versions.
What makes you think so? I tested it with 3.3.0 and 1.1.1w. I have no idea about 3.0.2.
My mistake. I glanced at the patch and saw the version number check, but didn't scroll to see the else statement.
As for the reason I'm using OpenSSL 3.0.x, that's just the current version in our base OS which is Ubuntu 22.04.
Have you tested the 3.5.x beta at all? I wonder if any progress on this issue has been made there?
I didn't notice anything that would help this in 3.5
That's a shame. I'm going to start tooling up to build from source and take a look at applying your patch. Thanks for the info!
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Yes, the analysis and a proposed patch is available above. This issue is not stale.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Yes, the analysis and a proposed patch is available above. This issue is not stale.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Yes, the analysis and a proposed patch is available above. This issue is not stale.
I will not be keeping the stale bot at bay, so that it doesn't kill the github issue until someone applies the proposed patch. I'm unsubscribing from this issue.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.
OpenSIPS version you are running
Crash Core Dump
What is interesting is that the crash is in
err_clear()
:But the code execution should not get there because printing the content of
*es
returns thattop==bottom
:Which leads me to an idea that this is a race condition.
Looks like that opensips first initializes modules (including tls_openssl/OpenSSL in general) and then forks TCP worker processes. Initialization of modules will make openssl create ERR_STATE struct in shared memory, and this will get then shared by all workers, sometimes leading to race conditions due to absent locking when accessing the now shared struct. I don't think OPENSSL expects this data to be shared among threads/processes. Manpage of ERR_pop_to_mark indicates that it's normally a per-thread data structure.
I found a similar issue here: https://github.com/kamailio/kamailio/issues/3695
Due to this being a race condition it only happens in specific usage scenarios (either high load, or in my case due to issuing several HTTP requests via rest_client in parallel from multiple workers - push notification being sent to multiple mobile devices at once)