OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.28k stars 581 forks source link

[BUG] race condition between CANCEL / 180 Ringing #1759

Closed tommybrecher closed 5 years ago

tommybrecher commented 5 years ago

OpenSIPS version you are running

version: opensips 2.3.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
git revision: 4a53081
main.c compiled on 19:34:03 Jul  2 2019 with gcc 4.9.2

Describe the bug

- During a call setup, a callee sends a 180 ringing simultaneously to the caller sending a CANCEL. - There's a race condition of multiple processes attempting to write the cancel buffer. `t->uac[branch].local_cancel.buffer` - While the process handling the callee's 180 proceeds to cancel_branch() to generate a local cancel, the process handling the callers CANCEL is at should_cancel_branch(). - In some cases `t->uac[b].local_cancel.buffer.s==NULL @t_cancel.h:63` evaluates True but it's not when hitting: `t->uac[b].local_cancel.buffer.s=BUSY_BUFFER` - In most cases this would only cause a memory leak, under some circumstances, when the timing is right if the process handling the calleer's side haven't written into the buffer yet after setting it to BUSY_BUFFER and the process handling the callees side's buffer was overwritten a SEGFAULT occurs. Please see gdb output below while I was reproducing this: ``` sensitive information replaced from the logs breakpoint: break t_cancel.h:67 if t->uac[b].local_cancel.buffer.s != NULL +--t_cancel.h-------------------------------------------------------------------------- |63 if ( t->uac[b].local_cancel.buffer.s==NULL ) { |64 if ( last_received>=100 && last_received<200 ) { |65 /* we'll cancel -- label it so that no one else |66 (e.g. another 200 branch) will try to do the same */ B+>|67 t->uac[b].local_cancel.buffer.s=BUSY_BUFFER; |68 return 1; |69 } else if (last_received==0) { |70 /* set flag to catch the delaied replies */ |71 t->uac[b].flags |= T_UAC_TO_CANCEL_FLAG; |72 } +-------------------------------------------------------------------------------------- (gdb) p t->uac[b].local_cancel.buffer.s $2693 = 0x7fff6a0099c0 "CANCEL sip:402@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.com:5060;transport=UDP SIP/2.0\r\nVia: SIP/2.0/UDP 34.224.106.98:5060;branch=z9hG4bK41e1.cc7c9522.0\r\nFrom:;tag=qWiV`ImJn1-X*\r\nCall-ID: Bxh*vKs?-0*~Xn)sN@x]VBlI?jB3hjm5BgjjZJzY\r\nTo: \r\nCSeq: 690457098 CANCEL\r\nMax-Forwards: 70\r\nReason: SIP;cause=480;text=\"NO_ANSWER\"\r (gdb) next (gdb) p t->uac[b].local_cancel.buffer.s $2699 = 0xffffffffffffffff (gdb) p t->uac[b].local_cancel.buffer.len // Length is still good $2700 = 508 (gdb) continue Continuing. Program received signal SIGSEGV, Segmentation fault. // Segmentation fault strnlen () at ../sysdeps/x86_64/strlen.S:137 ``` Multiple production crashes shows the same in all core-dumps (always callee side) ``` (gdb) bt #0 0x00007f3afa742040 in __strcasecmp_l_sse2 () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f39ea24a42c in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #2 0x00007f39ea24a5d4 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 #3 0x00007f39ea49685e in tls_write (fd=, poll_events=0x7fffe952ad14, len=382, buf=0xffffffffffffffff, c=0x7f39f87d8e58) at ../tls_mgm/tls_conn_server.h:504 #4 tls_blocking_write (c=0x7f39f87d8e58, fd=, buf=0xffffffffffffffff , len=382, api=, api=) at ../tls_mgm/tls_conn_server.h:578 #5 0x00007f39ea4993a3 in proto_tls_send (send_sock=0x400, buf=0xffffffffffffffff , len=382, to=0x7f39f87d8e64, id=1) at proto_tls.c:493 #6 0x00007f39e9167c90 in msg_send (msg=, len=, buf=, id=, to=, proto=, send_sock=0x7f3aea6b8be0) at ../../forward.h:123 #7 send_pr_buffer (rb=0x7f3a00739308, buf=0xffffffffffffffff, len=382, ctx=0x2fc, ctx@entry=0x0) at t_funcs.c:66 #8 0x00007f39e91824bf in cancel_branch (t=t@entry=0x7f3a00739068, branch=0) at t_cancel.c:113 #9 0x00007f39e915c5f2 in reply_received (p_msg=0x7f3aeab7c270) at t_reply.c:1542 #10 0x0000000000461a4c in forward_reply (msg=0x7f3aeab7c270) at forward.c:495 #11 0x0000000000446580 in receive_msg (buf=0x7f39ea69e4c0 "SIP/2.0 180 Ringing\r\nTo: ;tag=bbca5ea0d19f77aai0\r\nFrom: \"Customer Name Redacted\" ;tag=y2eyBcyFeN1tp\r\nCall-ID: c92225b1-0a1e-1238-1b99"..., len=4294967295, len@entry=419, rcv_info=0x3, existing_context=0x0) at receive.c:257 #12 0x00007f39ea498545 in tcp_handle_req (_max_msg_chunks=, con=0x7f39f87d8e58, req=0x7f39ea69e4c0 ) at ../../net/proto_tcp/tcp_common.h:412 #13 tls_read_req (con=0x7f39f87d8e58, bytes_read=0x7fffe952b240) at proto_tls.c:616 #14 0x0000000000534297 in handle_io (event_type=, idx=, fm=) at net/net_tcp_proc.c:231 #15 io_wait_loop_epoll (repeat=81930048, t=, h=) at net/../io_wait_loop.h:280 #16 0x0000000000535adf in tcp_worker_proc_loop () at net/net_tcp_proc.c:359 #17 0x000000000053eb9e in tcp_start_processes (chd_rank=chd_rank@entry=0x7dfc00 , startup_done=startup_done@entry=0x7f39ecd0ac88) at net/net_tcp.c:1841 #18 0x000000000041daa6 in main_loop () at main.c:683 #19 main (argc=, argv=) at main.c:1283 ``` **Additional Information:** ``` (gdb) frame 9 #9 0x00007f39e915c5f2 in reply_received (p_msg=0x7f3aeab7c270) at t_reply.c:1542 (gdb) p t->reply_mutex $4 = 1 (gdb) p t->nr_of_outgoings $5 = 1 (gdb) p t->ref_count $6 = 2 (gdb) p t->next_cell->ref_count $11 = 1 (gdb) p t->next_cell->reply_mutex $12 = 0 ``` ``` (gdb) frame 8 #8 0x00007f39e91824bf in cancel_branch (t=t@entry=0x7f3a00739068, branch=0) at t_cancel.c:113 # Locally generated CANCEL (gdb) p crb->buffer $25 = { s = 0x7f39fb528030 "CANCEL sip:username@5.5.5.5:5076;transport=tls SIP/2.0\r\nVia: SIP/2.0/TLS 3.3.3.3:10002;branch=z9hG4bKcbee.91b2e7f2.0\r\nFrom: \"Customer Name Redacted\" ;tag=y2eyBcyFeN1tp\r\nCall-ID: c92225b1-0a1e-1238-1b99-0ecbd32fbb84\r\nTo: \r\nCSeq: 5751113 CANCEL\r\nMax-Forwards: 70\r\nReason: SIP;cause=487;text=\"ORIGINATOR_CANCEL\"\r\nUser-Agent: Vonage\r\nContent-Length: 0\r\n\r\n", len = 430 } # CANCEL (caller side) (gdb) p t->next_cell->uas->request->first_line->u->request->method $11 = { s = 0x7f3a2403e6c8 "CANCEL sip:ABCD9469@172.31.10.10:1337;transport=tls SIP/2.0\r\nVia: SIP/2.0/UDP 11.222.222.222:5060;branch=z9hG4bKcbee.ed0162b.0\r\nFrom: \"NOCUSTOMER NAME\" ;tag=y2eyBcyFeN1tp\r\nCall-ID: c92225b1-0a1e-1238-1b99-0ecbd32fbb84\r\nTo: \r\nCSeq: 5751113 CANCEL\r\nMax-Forwards: 69\r\nReason: SIP;cause=487;text=\"ORIGINATOR_CANCEL\"\r\nUser-Agent: Vonage\r\nContent-Length: 0\r\n\r\n", len = 6 } # 200/canceling reply from OpenSIPs (gdb) p t->next_cell->uas->response->buffer $12 = { s = 0x7f39f7f63a60 "SIP/2.0 200 canceling\r\nVia: SIP/2.0/UDP 11.222.222.222:5060;received=11.222.222.222;rport=5060;branch=z9hG4bKcbee.ed0162b.0\r\nFrom: \"NOCUSTOMER NAME\" ;tag=y2eyBcyFeN1tp\r\nCall-ID: c92225b1-0a1e-1238-1b99-0ecbd32fbb84\r\nTo: ;tag=cde9c87b10b264dff35f60133576ec4a-c4dc\r\nCSeq: 5751113 CANCEL\r\nServer: Vonage\r\nContent-Length: 0\r\n\r\n", len = 397 } ``` **To Reproduce** a SEGFAULT is harder to reproduce, it's easier to get to the condition when passing the if == NULL condition in should_cancel_branch() and having the buffer overwritten This requires either a SIP generator such as SIPp and a callee side (can be a softphone) 1. Insert the breakpoint `break t_cancel.h:67 if t->uac[b].local_cancel.buffer.s != NULL` 2. Have a SIP generator (caller side) send INVITEs to callee and wait for 100 trying, 3. Get an estimate of how long it takes 180 and CANCEL to reach to opensips, modify your test script accordingly. 3. Together with the 180 from the callee side entering the OpenSIPs server, have the SIP generator send a CANCEL. 4. Loop this for a few minutes **Expected behavior** Expected behavior would be no overwriting of the buffer, maintaining a locked state while another process hasn't completed sending the buffer. **OS/environment information** - Operating System: Debian (Jessie) - OpenSIPS installation: compiled version of an opensips fork with some patches back ported packages packaged in .deb files and installed as binary. **Additional context**
razvancrainea commented 5 years ago

Thank you very much for the detailed troubleshoot! Thanks to the information provided, I managed to easily replicate this and fix it. Note that I've pushed fixes from master down to version 2.4, since that's the last supported version. However, if you are using 2.3, you should be easily cherry-pick the commit in your source tree.