OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.23k stars 571 forks source link

[CRASH] free_contacts and _unref_dlg segfault #2184

Closed Fernandojdk closed 3 years ago

Fernandojdk commented 4 years ago

OpenSIPS version you are running

version: opensips 3.1.0 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on 16:09:51 Jul 22 2020 with gcc 4.8.5

Installed via CentOS rpms - Latest version

Crash Core Dump https://pastebin.com/YtDVEmUE (core for error in free_contacts) https://pastebin.com/g1f8zAs2 (core generated together with error on _unref_dlg)

Describe the traffic that generated the bug Its a production server. The crash happens randomly several times a day. It only happens when it has high traffic and opensips load-all reaches >= 90%. More specifically, the crash occurs when reaches 200 CPS (calls per second) or more Whenever the crash happens it is in the same function: free_contacts Every time the crash occurs, two coredump files of same size are generated. One core dump have segfault on tm free_contacts function and the second core dump have segfault on dialogs _unref_dlg function. I posted a link of bt full for each core dump.

Opensips have a lot of free SHM and a lot of free PKG Mem.

To Reproduce

It is not possible to reproduce because it only happens with high traffic.

Relevant System Logs

Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:handle_sigs: child process 26137 exited by a signal 11
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:handle_sigs: core was generated
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:handle_sigs: terminating due to SIGCHLD
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26134]: INFO:core:sig_usr: signal 15 received
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26132]: INFO:core:sig_usr: signal 15 received
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26131]: INFO:core:sig_usr: signal 15 received
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26133]: INFO:core:sig_usr: signal 15 received
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26159]: WARNING:core:handle_timer_job: utimer job <tm-utimer> has a 280000 us delay in execution
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 4(26134) [timer] terminated, still waiting for 41 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26144]: WARNING:core:handle_timer_job: utimer job <tm-utimer> has a 280000 us delay in execution
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 2(26132) [MI FIFO] terminated, still waiting for 40 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 1(26131) [HTTPD 127.0.0.1:8080] terminated, still waiting for 39 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 3(26133) [time_keeper] terminated, still waiting for 38 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 12(26142) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 37 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 26(26156) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 36 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 16(26146) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 35 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 13(26143) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 34 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 36(26166) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 33 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 6(26136) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 32 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 27(26157) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 31 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 28(26158) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 30 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 10(26140) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 29 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 17(26147) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 28 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 9(26139) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 27 more
Jul 24 15:33:54 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 32(26162) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 26 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 34(26164) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 25 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 18(26148) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 24 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 40(26170) [SIP receiver udp:177.21.244.136:5080] terminated, still waiting for 23 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26149]: ERROR:dialog:fix_route_dialog: Script error - validate function before having a dialog
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26149]: ERROR:topology_hiding:topo_dlg_onroute: Failed to fix the SIP request according to dialog info
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 30(26160) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 22 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 39(26169) [SIP receiver udp:177.21.244.136:5080] terminated, still waiting for 21 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 25(26155) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 20 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26149]: ERROR:dialog:fix_route_dialog: Script error - validate function before having a dialog
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26149]: ERROR:topology_hiding:topo_dlg_onroute: Failed to fix the SIP request according to dialog info
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 35(26165) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 19 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 38(26168) [SIP receiver udp:177.21.244.136:5080] terminated, still waiting for 18 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 41(26171) [SIP receiver udp:177.21.244.136:5080] terminated, still waiting for 17 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 31(26161) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 16 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 37(26167) [SIP receiver udp:177.21.244.136:5080] terminated, still waiting for 15 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 20(26150) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 14 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 15(26145) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 13 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 11(26141) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 12 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 43(26173) [Timer handler] terminated, still waiting for 11 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 23(26153) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 10 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 19(26149) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 9 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 8(26138) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 8 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 42(26172) [SIP receiver udp:177.21.244.136:5080] terminated, still waiting for 7 more
Jul 24 15:33:55 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 29(26159) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 6 more
Jul 24 15:33:56 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 14(26144) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 5 more
Jul 24 15:33:56 sipproxy /usr/sbin/opensips[26129]: INFO:core:shutdown_opensips: process 5(26135) [SIP receiver udp:177.21.244.136:5060] terminated, still waiting for 4 more
Jul 24 15:33:59 sipproxy /usr/sbin/opensips[26129]: INFO:core:cleanup: cleanup
Jul 24 15:33:59 sipproxy /usr/sbin/opensips[26129]: CRITICAL:core:sig_usr: segfault in attendant (starter) process!

OS/environment information

Additional context

I use dialog + topology_hiding (force_dialog = 1) In branch_route I use uac_replace_from and uac_replace_to I use nathelper I'm using rtpengine for some calls. In branch_route i do nat_uac_test(1) and fix_nated_contact() before calling rtpngine_offer For some calls, i'm using SST Just proto_udp is loaded Opensips udp_workers is 32

Fernandojdk commented 4 years ago

If helps, this is output of dmesg:

[101724.308214] traps: opensips[8325] general protection ip:7f2b11444079 sp:7ffc9205d150 error:0 in libc-2.17.so[7f2b113f7000+1c3000]
[101735.792254] opensips[8294]: segfault at 8 ip 00007f2abb8d4802 sp 00007ffc9205e7e0 error 4 in dialog.so[7f2abb8a5000+7d000]
[108871.033118] opensips[25832]: segfault at 6a ip 00000000005362b8 sp 00007ffd6653b2d0 error 4 in opensips[400000+1fc000]
[109361.785576] opensips[12659]: segfault at 60 ip 00000000005362b8 sp 00007ffce7a77910 error 4 in opensips[400000+1fc000]
[109376.395498] opensips[12632]: segfault at 8 ip 00007f3337834802 sp 00007ffce7a77c60 error 4 in dialog.so[7f3337805000+7d000]
[113014.864189] opensips[16441]: segfault at 8 ip 00007f0af8f94802 sp 00007ffda2f6fbd0 error 4 in dialog.so[7f0af8f65000+7d000]
[113168.666018] opensips[7798]: segfault at 2d ip 00007f3f2d6c3079 sp 00007ffe07d171c0 error 4 in libc-2.17.so[7f3f2d676000+1c3000]
[113181.951784] opensips[7776]: segfault at 8 ip 00007f3ed7b53802 sp 00007ffe07d17d30 error 4 in dialog.so[7f3ed7b24000+7d000]
[114289.520314] traps: opensips[9156] general protection ip:5362b8 sp:7ffeadf63630 error:0 in opensips[400000+1fc000]
[114304.493978] opensips[9114]: segfault at 8 ip 00007f3654bff802 sp 00007ffeadf63980 error 4 in dialog.so[7f3654bd0000+7d000]
[114562.636829] traps: opensips[16966] general protection ip:5362b8 sp:7ffe2a4e62f0 error:0 in opensips[400000+1fc000]
[114577.568572] opensips[16955]: segfault at 8 ip 00007f214d6a0802 sp 00007ffe2a4e6640 error 4 in dialog.so[7f214d671000+7d000]
[115892.560088] opensips[18215]: segfault at 8 ip 00007f08e5a8d802 sp 00007ffee035f7d0 error 4 in dialog.so[7f08e5a5e000+7d000]
[116393.926393] traps: opensips[26137] general protection ip:5362b8 sp:7ffebafd10f0 error:0 in opensips[400000+1fc000]
[116412.988123] opensips[26129]: segfault at 8 ip 00007f773acde802 sp 00007ffebafd1440 error 4 in dialog.so[7f773acaf000+7d000]
[117852.971750] traps: opensips[28657] general protection ip:5362b8 sp:7fff6e4df780 error:0 in opensips[400000+1fc000]
[117869.847650] opensips[28627]: segfault at 8 ip 00007f696198e802 sp 00007fff6e4dfad0 error 4 in dialog.so[7f696195f000+7d000]
[118512.482171] traps: opensips[5905] general protection ip:4e1e78 sp:7ffef66711e0 error:0 in opensips[400000+1fc000]
[118532.311949] opensips[5884]: segfault at 8 ip 00007fa2cf982802 sp 00007ffef66724e0 error 4 in dialog.so[7fa2cf953000+7d000]
Fernandojdk commented 3 years ago

@bogdan-iancu Any instruction to deal with this issue? This is happening a lot at day, on different production environment and different hardware. I tried to look at source code to see who is trying to free a contact_t on the same time but with no success. If you give some direction what to look, i'll try to fix this and make a PR.

Thanks in advance.

bogdan-iancu commented 3 years ago

This is similar to #2095 and I'm chasing a possible race between a 200 OK and Cancel, race that may lead to a corruption of the SIP request cloned into shm.

I will close this one as duplicate and continue on 2095 as older - please monitor that one