OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.23k stars 571 forks source link

[CRASH] OpenSips crash on November 23, 2020 #2321

Closed adamoverbeeke closed 2 years ago

adamoverbeeke commented 3 years ago

OpenSIPS version you are running

version: opensips 2.4.8 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: a88a57479
main.c compiled on 21:17:43 Nov 17 2020 with gcc 7

Crash Core Dump https://pastebin.com/bHeUMqDN

Describe the traffic that generated the bug

Normal SIP traffic. A mix of protocols including UPD/TCP and TLS

To Reproduce

So far one crash. 

Relevant System Logs

Log output prior to 1 minute prior to crash: 

"1606131901000","11/23/2020 06:45:01.000 -0500","Nov 23 11:45:01 [7076] ERROR:mi_datagram:mi_datagram_server: command not available
"1606131901000","11/23/2020 06:45:01.000 -0500","Nov 23 11:45:01 pcv-sipproxy-forward-10-72-4-46 rotateFlatstore: Received error: 500
"1606131965000","11/23/2020 06:46:05.000 -0500","Nov 23 11:46:05 [7072] INFO:core:handle_sigs: child process 7098 exited by a signal 11
"1606131965000","11/23/2020 06:46:05.000 -0500","Nov 23 11:46:05 [7072] ERROR:proto_tls:tls_conn_shutdown: something wrong in SSL: 1, 9, Bad file descriptor
"1606131965000","11/23/2020 06:46:05.000 -0500","Nov 23 11:46:05 [7072] ERROR:proto_tls:tls_print_errstack: TLS errstack: error:140E0197:lib(20):func(224):reason(407)
"1606132149000","11/23/2020 06:49:09.000 -0500","Nov 23 11:49:09 pcv-sipproxy-forward-10-72-4-46 systemd: opensips.service: main process exited, code=exited, status=1/FAILURE
"1606132180000","11/23/2020 06:49:40.000 -0500","Nov 23 11:49:40 [21337] ERROR:core:udp_init_listener: bind(4f, 0x7f88dae6f344, 16) on 10.72.11.204: Cannot assign requested address
"1606132180000","11/23/2020 06:49:40.000 -0500","Nov 23 11:49:40 [21337] ERROR:core:trans_init_all_listeners: failed to init listener [10.72.11.204], proto udp
"1606132180000","11/23/2020 06:49:40.000 -0500","Nov 23 11:49:40 [21337] ERROR:core:main: failed to init all SIP listeners, aborting
"1606132180000","11/23/2020 06:49:40.000 -0500","Nov 23 11:49:40 pcv-sipproxy-forward-10-72-4-46 systemd: opensips.service: main process exited, code=exited, status=255

OS/environment information amazon_linux 2

Additional context

rvlad-patrascu commented 3 years ago

Hi @adamoverbeeke ,

Has the crash happened anymore since the first occurrence? What openssl version do you use?

Also, please open the core file with gdb and do:

f 4
p c->extra_data
bcnewlin commented 3 years ago

@rvlad-patrascu This crash did just occur again this morning.

# opensips -V
version: opensips 2.4.8 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: a88a57479
main.c compiled on 01:39:42 Dec  5 2020 with gcc 7

Here is the new backtrace with the added info you requested: https://pastebin.com/L0S8bjSf

rvlad-patrascu commented 3 years ago

Hi @bcnewlin ,

Unfortunately, I still cannot identify at this point the source of the crash from the info in the backtrace. Can you perhaps try to compile with QM_MALLOC and DBG_MALLOC ?

bcnewlin commented 3 years ago

@rvlad-patrascu This is not the first time we've had a crash that could not be solved without compiling the debug flags, however we are very hesitant to do so. These crashes are only occurring in our production environments; we cannot reproduce them. So this means we must run with debug enabled in all of our production. This is a decent performance hit, not to mention any other unintended consequences of the extra debug processing. And since it is a compile flag we cannot just disable it on-demand if we find any issues.

This seems like a pretty big hole in OpenSIPS' troubleshooting capabilities to always need these compile flags to find crash causes. Has there ever been any discussion or effort to change that?

Having said all that, we will investigate whether we can accept the risk of enabling the debug flags in production.

rvlad-patrascu commented 3 years ago

@bcnewlin Starting with OpenSIPS 3.0 you can in fact build with support for all the allocators and their debugging flavor and then select one of them via command line options at startup (-a or -k and -s).

bcnewlin commented 3 years ago

@rvlad-patrascu That is pretty awesome and yet another reason we need to get on with upgrading! :)

adamoverbeeke commented 3 years ago

https://pastebin.com/eEp49Cgi

-- This version has removed https client calls via a http --> https proxy. -- This version has tls enabled

$ opensips -V version: opensips 2.4.9 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll, sigio_rt, select. git revision: 3bed2f312 main.c compiled on 19:38:07 Jan 26 2021 with gcc 7

adamoverbeeke commented 3 years ago

We are going to try compiling with the flags mentioned above. QM_MALLOC and DBG_MALLOC.

bogdan-iancu commented 3 years ago

Same here, first do the upgrade to latest 2.4 from GIT and re-test !

bcnewlin commented 3 years ago

@bogdan-iancu Unfortunately we have already updated and the crashes are still occurring, though I am not sure whether they are these crashes or the ones in #2216 . We are still investigating.

bogdan-iancu commented 3 years ago

OK, what revision are you testing with ? (opensips -V)

bcnewlin commented 3 years ago

We've just upgraded to the latest commit. I am monitoring for a crash recurrence.

$ opensips -V
version: opensips 2.4.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 078887fef
main.c compiled on 15:07:57 Apr 22 2021 with gcc 7
bcnewlin commented 3 years ago

We've had multiple recurrences of this crash on the latest OpenSIPS version.

Backtrace: https://pastebin.com/xKcvjpxT

# opensips -V
version: opensips 2.4.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 078887fef
main.c compiled on 15:03:48 Apr 22 2021 with gcc 7
adamoverbeeke commented 3 years ago

We had a few more crashes. I've enable QM_MALLOC, DBG_MALLOC and preformed

(gdb) f 4
(gdb) p c->extra_data

Backtrace: https://pastebin.com/4wWNxczD Backtrace: https://pastebin.com/VXPSxk6y Backtrace: https://pastebin.com/Yq79LNaq

#opensips -V
version: opensips 2.4.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, QM_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 078887fef
main.c compiled on 14:20:17 Apr 30 2021 with gcc 7
adamoverbeeke commented 3 years ago

Another one today. I enabled QM_MALLOC, DBG_MALLOC and increased the log level to 4.

Backtrace: https://pastebin.com/K5EUb4J4 Backtrace: https://pastebin.com/q1YwJJ6U

Logs prior to core was generated: https://pastebin.com/ZHZp71af

version: opensips 2.4.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, QM_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 078887fef
main.c compiled on 21:12:54 Apr 30 2021 with gcc 7
adamoverbeeke commented 3 years ago

any update on this?

bcnewlin commented 3 years ago

Anyone have any time to investigate this? We're seeing the crashes quite frequently. We can get more information if needed.

rvlad-patrascu commented 3 years ago

Hi @bcnewlin ,

It is still difficult to draw any conclusions from the backtraces, I'm trying to come up with a strategy for digging up more useful information. But there is a hint about some potentially problematic code in openssl 1.0.2. Can you confirm that this is the version you are using, and if so, try to upgrade to at least 1.1.0 ?

bcnewlin commented 3 years ago

@rvlad-patrascu I can test this, but it will not be a workable production solution as we require FIPS support which is only provided by openssl 1.0.2.

rvlad-patrascu commented 3 years ago

We do intend to come up with the proper fix for 1.0.2 but nevertheless it will help narrow down the issue if you manage to test with openssl >= 1.1.0.

bcnewlin commented 3 years ago

@rvlad-patrascu Understood. I'm getting a build together now to test against the latest OpenSIPS 2.4 and with openSSL 1.1.1. The crash has been very frequent in our testing environments so I hope to have some results next week.

bcnewlin commented 3 years ago

@rvlad-patrascu We have not had any success getting a build testing a newer version of openSSL. We use Amazon Linux 2 which has their own distros and replacing the built-in openSSL version it uses is proving difficult.

Have you thought of any other information that would assist? Have we already provided everything useful from the DBG_MALLOC reproductions?

rvlad-patrascu commented 3 years ago

Hi @bcnewlin ,

I don't think we can extract any other useful info provided by the DBG_MALLOC flag. At this point, the best lead we heave is related to the openssl library.

Can you perhaps try to install openssl from sources? If you manage to compile the library with the -DOPENSSL_NO_BUF_FREELISTS option (for the ./config script), it would eliminate the need for a newer version.

bcnewlin commented 3 years ago

@rvlad-patrascu We can compile from the openSSL sources, however Amazon Linux 2 uses its own version of openSSL which is pre-compiled and sources are not available as far as I know. Is this suggestion intended for troubleshooting/debugging or is it being suggested as a final fix?

rvlad-patrascu commented 3 years ago

@bcnewlin so you are saying that you cannot use your own openssl build on Amazon Linux 2?

Initially it seemed that we could do the fix in opensips, so the openssl upgrade/custom build suggestion was intended only for confirming the suspicions regarding the cause of the crash. But after digging more into this we realized that it's not really possible so we probably need to go via the openssl route.

bcnewlin commented 3 years ago

@rvlad-patrascu We are trying to get it to work now, but we cannot just replace the AWS version there are other dependencies on it. Additionally, we have a need for FIPS support and using our own version will likely not achieve that certification.

If you have identified a bug in openSSL we may be able to engage AWS to fix it, as they are still supporting openSSL 1.0.2.

rvlad-patrascu commented 3 years ago

Similarly with issues we've had before with openssl, it is probably not a bug in the library per se, but related to the incompatibility with the opensips multi-process and SSL context sharing model. And openSSL made it clear that it is not their intention to support such usage.

Nevertheless we are not 100% sure yet so let's see what are the results testing with the build option suggested above. And I forgot to mention to also add the -d option to include debugging symbols for openssl, in case we still get crashes.

bcnewlin commented 3 years ago

That flag is also added to the config step? I'm having some trouble verifying that it is being passed through. Also, should we continue to run with the OpenSIPS debug flags set in case of crashes?

rvlad-patrascu commented 3 years ago

Yes, -d is an option for openssl's config script. And opensips should be compiled with debug symbols too. The idea is to get backtraces with complete file/func/line info, including the openssl code, in case of crashes.

bcnewlin commented 2 years ago

@rvlad-patrascu Apologies for the lack of updates on this. I was able to get OpenSIPS working with a compiled openssl version including the flags you requested, however the crashes are no longer occurring on any version. I am guessing that some change has slightly modified timing and alleviated the crash. We are continuing to monitor for recurrences, but at this time are planning to expedite our move to 3.2 and switch to wolfSSL.

rvlad-patrascu commented 2 years ago

Hi @bcnewlin

Indeed these kind of crashes can be highly dependent on the memory allocation patterns, races when acquiring locks etc. so you might avoid it for some time but the root cause can still be there. But let's see how it goes with that openssl version with the custom flags.

And since you are planning to upgrade to 3.2, I think we close this ticket for now and you can reopen if the crashes happen again.