OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.27k stars 578 forks source link

[BUG] Segfault in free_contacts() under load in 2.4.7 #2095

Closed attermann closed 4 years ago

attermann commented 4 years ago

OpenSIPS version you are running

version: opensips 2.4.7 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 495b980
main.c compiled on 03:18:29 Apr 10 2020 with gcc 5.4.0

Describe the bug

Opensips crashes randomly under load. Sometime multiple times a day, and other times a day or more between crashes.

Some backtraces show slight variations in the call stack but always eventually through run_trans_callbacks_locked() and segfaults in free_contacts ().

To Reproduce

Unable to reproduce. Happens randomly in production under load.

Expected behavior

No segfault!

Relevant System Logs

Apr 28 20:37
core.opensips.sig11.18058

(gdb) bt full
#0  free_contacts (_c=0x7f5ece72f610) at parser/contact/contact.c:296
        ptr = 0x6e6f3d32723b
#1  0x00000000005aa5fa in free_contact (_c=_c@entry=0x7f5eb32530e0)
    at parser/contact/parse_contact.c:107
No locals.
#2  0x000000000053e75c in clean_hdr_field (hf=hf@entry=0x7f5eb32530b0)
    at parser/hf.c:87
No locals.
#3  0x00007f5eadfea2c7 in clean_msg_clone (msg=0x7f5eb32521e0, 
    max=0x7f5eb3253a78, min=0x7f5eb32521e0) at sip_msg.h:116
        hdr = 0x7f5eb32530b0
#4  run_trans_callbacks_locked (type=type@entry=128, 
    trans=trans@entry=0x7f5eb2ecdf08, req=<optimized out>, 
    rpl=rpl@entry=0x7f5ece745cc8, code=code@entry=200) at t_hooks.c:265
No locals.
#5  0x00007f5eadf976ae in relay_reply (t=<optimized out>, 
    p_msg=<optimized out>, branch=<optimized out>, msg_status=<optimized out>, 
    cancel_bitmap=<optimized out>) at t_reply.c:1353
        relay = <optimized out>
        save_clone = <optimized out>
        buf = 0x7f5ece731738 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP <redacted source>:5060;branch=z9hG4bKt9ULXZ7fmfq6e5fK61D6FA\r\nRecord-Route: <sip:<redacted interface>:7080;transport=tcp;r2=on;lr;did=1f3.6b5ad8f6>\r\nRecord-Route: <sip:<redacted interface>"...
        res_len = 1015
        relayed_code = 200
        relayed_msg = <optimized out>
        bm = {to_tag_val = {s = 0x7f5eb32521e0 "\212B\016", len = 4455636}}
        totag_retr = <optimized out>
        uas_rb = 0x7f5eb2ece008
        cb_s = {
          s = 0x7f5ece731738 "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP <redacted source>:5060;branch=z9hG4bKt9ULXZ7fmfq6e5fK61D6FA\r\nRecord-Route: <sip<redacted interface>:7080;transport=tcp;r2=on;lr;did=1f3.6b5ad8f6>\r\nRecord-Route: <sip:<redacted interface>"..., len = 1015}
        text = {s = 0x7f5ece745cc8 "_b\002", len = -1515613952}
        __FUNCTION__ = "relay_reply"
#6  0x00007f5eadf9e8a7 in reply_received (p_msg=0x7f5ece745cc8)
    at t_reply.c:1603
        msg_status = 200
        last_uac_status = 100
        branch = 0
        reply_status = <optimized out>
        timer = 2
        cancel_bitmap = 0
        uac = <optimized out>
        t = 0x7f5eb2ecdf08
        backup_list = <optimized out>
        has_reply_route = <optimized out>
        __FUNCTION__ = "reply_received"
#7  0x0000000000470a71 in forward_reply (msg=msg@entry=0x7f5ece745cc8)
    at forward.c:495
        new_buf = 0x0
        to = 0x0
        new_len = 1
        mod = 0x7f5ece6425e8
        proto = <optimized out>
        id = 0
        send_sock = <optimized out>
        s = <optimized out>
        len = <optimized out>
        __FUNCTION__ = "forward_reply"
#8  0x0000000000446d62 in receive_msg (
    buf=buf@entry=0x9bbec0 <tcp_current_req> "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP <redacted interface>:7080;branch=z9hG4bK9529.26b38601.0\r\nVia: SIP/2.0/UDP <redacted source>:5060;branch=z9hG4bKt9ULXZ7fmfq6e5fK61D6FA\r\nRecord-Route: <sip:<redacted interface>:7080;transpo"..., len=<optimized out>, rcv_info=rcv_info@entry=0x7ffd44971ba0, 
    existing_context=existing_context@entry=0x0, flags=flags@entry=0)
    at receive.c:258
        ctx = 0x7f5ece72de40
        msg = 0x7f5ece745cc8
        start = {tv_sec = 1588106249, tv_usec = 998674}
        rc = 3
        in_buff = {
          s = 0x9bbec0 <tcp_current_req> "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP <<redacted interface>:7080;branch=z9hG4bK9529.26b38601.0\r\nVia: SIP/2.0/UDP <redacted ource>:5060;branch=z9hG4bKt9ULXZ7fmfq6e5fK61D6FA\r\nRecord-Route: <sip:<redacted interface>:7080;transpo"..., len = 1075}
        __FUNCTION__ = "receive_msg"
#9  0x000000000060b443 in tcp_handle_req (_max_msg_chunks=<optimized out>, 
    con=<optimized out>, req=0x9bbec0 <tcp_current_req>)
    at net/proto_tcp/tcp_common.h:411
        local_rcv = {src_ip = {af = 2, len = 4, u = {addrl = {3526109766, 0}, 
              addr32 = {3526109766, 0, 0, 0}, addr16 = {10822, 53804, 0, 0, 0, 
                0, 0, 0}, addr = "F*,\322", '\000' <repeats 11 times>}}, 
          dst_ip = {af = 2, len = 4, u = {addrl = {3408669254, 0}, addr32 = {
                3408669254, 0, 0, 0}, addr16 = {10822, 52012, 0, 0, 0, 0, 0, 
                0}, addr = "F*,\313", '\000' <repeats 11 times>}}, 
          src_port = 33762, dst_port = 7080, proto = 2, 
          proto_reserved1 = 647886724, proto_reserved2 = 0, src_su = {s = {
              sa_family = 2, 
              sa_data = "\203\342F*,\322\000\000\000\000\000\000\000"}, sin = {
              sin_family = 2, sin_port = 57987, sin_addr = {
                s_addr = 3526109766}, 
              sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {
              sin6_family = 2, sin6_port = 57987, sin6_flowinfo = 3526109766, 
              sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, 
                  __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 
                    0, 0}}}, sin6_scope_id = 1023}}, 
          bind_address = 0x7f5ece63ef10}
        msg_buf = 0x9bbec0 <tcp_current_req> "SIP/2.0 200 OK\r\nVia: SIP/2.0/TCP <redacted interface>:7080;branch=z9hG4bK9529.26b38601.0\r\nVia: SIP/2.0/UDP <redacted_source>:5060;branch=z9hG4bKt9ULXZ7fmfq6e5fK61D6FA\r\nRecord-Route: <sip:<redacted interface>:7080;transpo"...
        msg_len = <optimized out>
        c = 56 '8'
        size = 0
#10 tcp_read_req (con=0x7f5eaed175f8, bytes_read=0x7ffd44971c80)
    at net/proto_tcp/proto_tcp.c:1187
        bytes = <optimized out>
        total_bytes = 1075
        req = 0x9bbec0 <tcp_current_req>
        src_su = {s = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, 
          sin = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, 
            sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {
            sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {
              __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, 
                __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 
                  0, 0}}}, sin6_scope_id = 6273364}}
        dst_su = {s = {sa_family = 0, 
            sa_data = "\000\000\000\000\000\000\000\221\251\245-\325a#"}, 
          sin = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, 
            sin_zero = "\000\221\251\245-\325a#"}, sin6 = {sin6_family = 0, 
            sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {__in6_u = {
                __u6_addr8 = "\000\221\251\245-\325a#\000\000\000\000\000\000\000", __u6_addr16 = {37120, 42409, 54573, 9057, 0, 0, 0, 0}, __u6_addr32 = {
                  2779353344, 593614125, 0, 0}}}, sin6_scope_id = 2932582632}}
        __FUNCTION__ = "tcp_read_req"
#11 0x00000000005e88f7 in handle_io (fm=0x7f5ece67d378, idx=idx@entry=49, 
    event_type=event_type@entry=1) at net/net_tcp_proc.c:251
        ret = 0
        n = <optimized out>
        con = 0x7f5eaed175f8
        s = 309
        rw = <optimized out>
        resp = <optimized out>
        response = {140044726089208, 1}
        __FUNCTION__ = "handle_io"
#12 0x00000000005ea88b in io_wait_loop_epoll (h=<optimized out>, 
    t=<optimized out>, repeat=<optimized out>) at net/../io_wait_loop.h:280
        ep_event = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}}
        ret = <optimized out>
        n = 1
        r = 49
        e = <optimized out>
        fd = <optimized out>
        i = <optimized out>
#13 tcp_worker_proc_loop () at net/net_tcp_proc.c:396
        __FUNCTION__ = "tcp_worker_proc_loop"
#14 0x00000000005f87fb in tcp_start_processes (
    chd_rank=chd_rank@entry=0x998104 <chd_rank>, 
    startup_done=startup_done@entry=0x0) at net/net_tcp.c:1909
        r = 31
        reader_fd = {203, 205}
        pid = <optimized out>
        __FUNCTION__ = "tcp_start_processes"
#15 0x000000000041d343 in main_loop () at main.c:790
        startup_done = 0x0
        chd_rank = 48
        rc = <optimized out>
#16 main (argc=<optimized out>, argv=<optimized out>) at main.c:1442
        cfg_stream = <optimized out>
        c = <optimized out>
        r = <optimized out>
        tmp = 0x7ffd44973ee1 ""
        tmp_len = <optimized out>
        port = <optimized out>
        proto = <optimized out>
        protos_no = <optimized out>
        options = 0x639290 "f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:P:G:W:o:"
        ret = -1
        seed = 1514354603
        __FUNCTION__ = "main"

OS/environment information

Additional context

Seems to be the same issue as reported mid-March in OpenSIPS-Users "Re: segfault at the opensips 2.4.5 (Liviu Chircu) (Johan De Clercq)".

attermann commented 4 years ago

I would really appreciate a response to this bug report. If more details are required to troubleshoot this I will provide whatever I can. I am also happy to work on a fix for this myself but I'd like to hear back first to know whether perhaps this has already been fixed since there was a similar report to mine.

attermann commented 4 years ago

It's worth noting that since reverting back to 2.3 (2.3.6 to be exact) we have not encountered this segfault again.

liviuchircu commented 4 years ago

hey @attermann -- we're pretty busy these days, getting the 3.1 release ready. Expect this issue to get more attention once we hit the "feature freeze" point, within less than a couple of weeks.

Meanwhile, if you can reproduce this crash in a lab, I highly recommend you use the git bisect feature and help us locate the commit which broke it. For example, on latest 2.4 branch, you could search for the commit which broke it using this efficient technique:

git bisect start # begin hunting the bad commit
git bisect bad # mark the current, last commit as "bad"
git checkout a28b3d4ae5c9 # go to your 2.3.x "found-to-be-good" commit
git bisect good # actually mark this commit as "good"
# at this point, git will select a half-way commit.  Re-compile and re-test.  Based on your results, do either:
git bisect good / git bisect bad
# at this point, git will select a half-way commit.  Re-compile and re-test.  Based on your results, do either:
git bisect good / git bisect bad
...
# keep going until you find the exact commit which broke things.  This will take logarithmic time (say, 5-8 steps)

Cheers!

vitalikvoip commented 4 years ago

Also have this problem and can't reproduce it intentionally. Used 2.4.5 and have another issue:

May 20 12:04:18 west /usr/sbin/opensips[1745]: CRITICAL:core:io_watch_add: #012>>> [TCP_main] BUG trying to overwrite entry 738 in the hash(738, 19, 0x7fbab137ea38,1) with (738, 19, 0x7fbab1235138,1)#012#012It seems you have hit a programming bug. Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues

Updated to 2.4.7 and have these crashes. It crashes either when tm wants to clean a transaction or in dialog when it wants to get a contact from sip_msg->contact->parsed. "sip_msg->contact->parsed" looks like a valid pointer but the memory it points to looks overwritten:

#3  0x000000000050cb6d in free_contact (_c=0x7f81ba070a20) at parser/contact/parse_contact.c:107
107     parser/contact/parse_contact.c: No such file or directory.
(gdb) p _c
$1 = (contact_body_t **) 0x7f81ba070a20
(gdb) p **_c
$2 = {star = 0 '\000', contacts = 0x63203a73253a6572}
#0  0x00000000004d7fd4 in do_free_params (_p=0xa736e6f69, _shm=0) at parser/parse_param.c:453
#1  0x00000000004d8048 in free_params (_p=0xa736e6f69) at parser/parse_param.c:465
#2  0x000000000050c22f in free_contacts (_c=0x7f81c697b058) at parser/contact/contact.c:298
#3  0x000000000050cb6d in free_contact (_c=0x7f81ba070a20) at parser/contact/parse_contact.c:107
#4  0x00000000004c2f59 in clean_hdr_field (hf=0x7f81ba0709f0) at parser/hf.c:87
#5  0x00007f81b60f20b8 in clean_msg_clone (msg=0x7f81ba06e360, min=0x7f81ba06e360, max=0x7f81ba0710f0) at sip_msg.h:116
#6  0x00007f81b60f2cc0 in run_trans_callbacks_locked (type=128, trans=0x7f81ba0d8bf0, req=0x7f81ba06e360, rpl=0x7f81c697b188, code=200) at t_hooks.c:265
#7  0x00007f81b6101e0d in relay_reply (t=0x7f81ba0d8bf0, p_msg=0x7f81c697b188, branch=0, msg_status=200, cancel_bitmap=0x7ffc7517ad2c) at t_reply.c:1358
#8  0x00007f81b610307c in reply_received (p_msg=0x7f81c697b188) at t_reply.c:1608
#9  0x00000000004364a6 in forward_reply (msg=0x7f81c697b188) at forward.c:495
#10 0x0000000000474e2d in receive_msg (
    buf=0x8dd860 <tcp_current_req> "SIP/2.0 200 OK\r\nCSeq: 3 INVITE\r\nCall-ID: 4fb0a15e993d8b7a-c8a2fa2d29abf7c7\r\nFrom: \"jhon.doe\" <sip:123123123@john.com>;tag=241273b736d581e8\r\nTo: <sip:333222111@john.com>;tag=60147127_c9b66bbf"..., len=4711, 
    rcv_info=0x7ffc7517aef0, existing_context=0x0, flags=0) at receive.c:258
#11 0x000000000054c7f2 in tcp_handle_req (req=0x8dd860 <tcp_current_req>, con=0x7f81b9041f48, _max_msg_chunks=10) at net/proto_tcp/tcp_common.h:411
#12 0x0000000000551f3a in tcp_read_req (con=0x7f81b9041f48, bytes_read=0x7ffc7517b0ac) at net/proto_tcp/proto_tcp.c:1275
#13 0x000000000053fb02 in handle_io (fm=0x7f81c690c0f0, idx=0, event_type=1) at net/net_tcp_proc.c:255
#14 0x000000000053e055 in io_wait_loop_epoll (h=0x8fe4e0 <_worker_io>, t=2, repeat=0) at net/../io_wait_loop.h:280
#15 0x0000000000540c1a in tcp_worker_proc_loop () at net/net_tcp_proc.c:401
#16 0x00000000005380bc in tcp_start_processes (chd_rank=0x80071c <chd_rank>, startup_done=0x7f81b8c8db70) at net/net_tcp.c:2063
#17 0x000000000043ce44 in main_loop () at main.c:790
#18 0x000000000043fae9 in main (argc=13, argv=0x7ffc7517b3e8) at main.c:1442
vitalikvoip commented 4 years ago

@attermann do you use dialog, rtpengine, rest_client modules?

I've briefly checked what changed and seen some contact related changes in dialog module. It's just an assumption and I have no proofs. I've rebuilt opensips with q_malloc and memdbg and sometimes it notifies about overwritten memory chunks.

attermann commented 4 years ago

@vitalikvoip Yes I do use the dialog module but only on a small percentage of total calls. I do have one core that crashed in dialog though. Backtrace below for reference.

#0  0x0000000000523d1f in fm_remove_free (n=0x7f076df73418, qm=0x7f076de7a010)
    at mem/f_malloc.c:209
        pf = 0x0
        hash = 0
#1  fm_malloc (qm=0x7f076de7a010, size=0) at mem/f_malloc.c:438
        frag = 0x7f076df73418
        n = <optimized out>
        hash = 0
        __FUNCTION__ = "fm_malloc"
#2  0x00007f074bbe18bc in build_extra_hdr (cell=0x7f0752fd9890, 
    str_hdr=0x7ffce3f9e2d0, extra_hdrs=0x0) at dlg_req_within.c:337
        p = <optimized out>
#3  dlg_end_dlg (dlg=dlg@entry=0x7f0752fd9890, 
    extra_hdrs=extra_hdrs@entry=0x0, send_byes=send_byes@entry=1)
    at dlg_req_within.c:459
        str_hdr = {s = 0x0, len = 0}
        t = 0x10
        i = <optimized out>
        res = 0
        __FUNCTION__ = "dlg_end_dlg"
#4  0x00007f074bbf8d4d in dlg_ontimeout (tl=0x7f0752fd98e0)
    at dlg_handlers.c:2095
        old_ctx = <optimized out>
        new_ctx = <optimized out>
        dlg = 0x7f0752fd9890
        new_state = 0
        old_state = 757703520
        unref = 5
        do_expire_actions = 1
        __FUNCTION__ = "dlg_ontimeout"
#5  0x00007f074bbdcac9 in dlg_timer_routine (ticks=<optimized out>, 
    attr=<optimized out>) at dlg_timer.c:602
        tl = 0xffffffffffffffff
        ctl = 0x7f0752fd98e0
        __FUNCTION__ = "dlg_timer_routine"
#6  0x00000000004ba09c in handle_timer_job () at timer.c:731
        t = 0x7f074e186440
        __FUNCTION__ = "handle_timer_job"
#7  0x00000000005e874a in handle_io (fm=0x7f076debfae8, idx=idx@entry=117, 
    event_type=event_type@entry=1) at net/net_tcp_proc.c:134
        ret = 0
        n = <optimized out>
        con = <optimized out>
        s = 302
        rw = <optimized out>
        resp = <optimized out>
        response = {139669392015104, 1}
        __FUNCTION__ = "handle_io"
#8  0x00000000005ea88b in io_wait_loop_epoll (h=<optimized out>, 
    t=<optimized out>, repeat=<optimized out>) at net/../io_wait_loop.h:280
        ep_event = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}}
        ret = <optimized out>
        n = 1
        r = 117
        e = <optimized out>
        fd = <optimized out>
        i = <optimized out>
#9  tcp_worker_proc_loop () at net/net_tcp_proc.c:396
        __FUNCTION__ = "tcp_worker_proc_loop"
#10 0x00000000005f87fb in tcp_start_processes (
    chd_rank=chd_rank@entry=0x998104 <chd_rank>, 
    startup_done=startup_done@entry=0x0) at net/net_tcp.c:1909
        r = 23
        reader_fd = {187, 189}
        pid = <optimized out>
        __FUNCTION__ = "tcp_start_processes"
#11 0x000000000041d343 in main_loop () at main.c:790
        startup_done = 0x0
        chd_rank = 40
        rc = <optimized out>
#12 main (argc=<optimized out>, argv=<optimized out>) at main.c:1442
        cfg_stream = <optimized out>
        c = <optimized out>
        r = <optimized out>
        tmp = 0x7ffce3f9eee1 ""
        tmp_len = <optimized out>
        port = <optimized out>
        proto = <optimized out>
        protos_no = <optimized out>
        options = 0x639290 "f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:P:G:W:o:"
        ret = -1
        seed = 4044006346
        __FUNCTION__ = "main"
liviuchircu commented 4 years ago

@vitalikvoip and @attermann : I did some stress tests on 2.4.7 (f8e19b62a) with "dialog" and "topology_hiding", but could not find any crash. If I remember correctly, the free_contacts() crash resembles some kind of PKG memory corruption, so any module could be the culprit, really.

Bottom line: please provide either your opensips.cfg or a minimal version of it that still triggers the crash (ship to liviu@opensips.org if you cannot make it public), so I can study/compare them and get closer to a diagnosis on this memory corruption. Alternatively, if you can help me reproduce the crash on my lab setup, the bug is 80%+ solved.

Cheers!

attermann commented 4 years ago

@liviuchircu Unfortunately I haven't been able to reproduce outside of production so I'm going to have to send my whole config. I trimmed a lot of commented sections and anonymized so hopefully I didn't break it. Note that most of the "external" connections are TCP, and "internal" are UDP.

opensips.cfg:

####### Global Parameters #########

#log_level=2
log_level=3
#log_level=4
log_stderror=no
log_facility=LOG_LOCAL1

children=4

/* uncomment the following line to enable debugging */
#debug_mode=yes

/* comment the next line to enable the auto discovery of local aliases
   based on revers DNS on IPs */
auto_aliases=no

# CBA
check_via=no
dns=no
rev_dns=no
dns_try_ipv6=no
dns_retr_time=1
dns_retr_no=3

disable_dns_blacklist=yes
disable_dns_failover=no
dns_use_search_list=no

# CBA TCP related params
open_files_limit=16384
#tcp_children=16
tcp_children=32
tcp_max_connections=16384
# test params for TCP connection keep-alive (to resolve webrtc client port changing by setting connection lifetime to exceed registration interval)
#tcp_connection_lifetime=130
tcp_connection_lifetime=610
#tcp_keepalive=1
#tcp_keepcount=3
#tcp_keepidle=300
#tcp_keepinterval=300

# CBA performance checks
event_pkg_threshold = 90
event_shm_threshold = 90
#exec_dns_threshold = 60000
exec_dns_threshold = 100000 /* 100 milliseconds */
#exec_msg_threshold = 60000
exec_msg_threshold = 1000000 /* 1 second */
#tcp_threshold = 60000
tcp_threshold = 100000 /* 100 milliseconds */

# include local config
include_file "local.cfg"

# ------------------ module loading ----------------------------------
#mpath="/usr/lib/x86_64-linux-gnu/opensips/modules"
mpath="/usr/local/opensips/lib64/opensips/modules"

#### SIGNALING module
loadmodule "signaling.so"

#### StateLess module
loadmodule "sl.so"

#### Transaction Module
loadmodule "tm.so"
modparam("tm", "fr_timeout", 5)
#modparam("tm", "fr_inv_timeout", 30)
modparam("tm", "fr_inv_timeout", 120)
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "onreply_avp_mode", 0)

#### Record Route Module
loadmodule "rr.so"
/* do not append from tag to the RR (no need for this script) */
modparam("rr", "append_fromtag", 0)

#### MAX ForWarD module
loadmodule "maxfwd.so"

#### SIP MSG OPerationS module
loadmodule "sipmsgops.so"

#### FIFO Management Interface
loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_name", "/tmp/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0666)

#### URI module
loadmodule "uri.so"
modparam("uri", "use_uri_table", 0)

#### UDP protocol
loadmodule "proto_udp.so"
loadmodule "proto_tcp.so"

#### TEXTOPS module
loadmodule "textops.so"

#### DbText module
loadmodule "db_text.so"
modparam("db_text", "db_mode", 1)

#### nat traversal module
loadmodule "nat_traversal.so"
modparam("nat_traversal", "keepalive_state_file", "/var/run/opensips/keepalive_state")

##### RTPengine module
loadmodule "rtpengine.so"
modparam("rtpengine", "extra_id_pv", "$avp(extra_id)")
# "rtpengine_sock" is only means to specify rtpengine server in 2.3 and below
modparam("rtpengine", "rtpengine_sock", "udp:1.1.1.2:2223 udp:1.1.1.3:2223")
# db_url" for specifying rtpengine servers is only supported in 2.4 and above
#modparam("rtpengine", "db_url", "text:///etc/opensips/dbtext")

##### user location module
loadmodule "usrloc.so"
modparam("usrloc", "use_domain", 1)
modparam("usrloc", "nat_bflag", "NAT")
modparam("usrloc", "db_mode", 0)
#modparam("usrloc", "db_mode", 1)
#modparam("usrloc", "db_url", "mysql://opensips:opensipsrw@localhost/opensips")

##### registrar module
loadmodule "registrar.so"
/* uncomment the next line not to allow more than 10 contacts per AOR */
modparam("registrar", "max_contacts", 10)

##### Mid-Registrar module
loadmodule "mid_registrar.so"
modparam("mid_registrar", "mode", 2) /* 0 = contact-mirroring / 1 = contact-throttling / 2 = aor-throttling */
modparam("mid_registrar", "outgoing_expires", 7200)
#modparam("mid_registrar", "insertion_mode", 0) /* 0 = contact; 1 = path */
/* uncomment the next line not to allow more than 10 contacts per AOR */
modparam("mid_registrar", "max_contacts", 10)
# CBA set TCP connections to close only after reg expiry
modparam("mid_registrar", "tcp_persistent_flag", 10)

##### dialog module (for call duration limits)
loadmodule "dialog.so"

##### dns_cache module
loadmodule "cachedb_local.so"
modparam("cachedb_local", "cachedb_url", "local://")
loadmodule "dns_cache.so"
modparam("dns_cache", "cachedb_url", "local://")
modparam("dns_cache", "blacklist_timeout", 10) # 10 seconds

####### Routing Logic ########

# main request routing logic

route {

  # perform initial message validation
  if (!mf_process_maxfwd_header("10"))
  {
    xlog("L_NOTICE", "[$ci] sanity: rejecting due to maxfwd\n");
    sl_send_reply( "483", "Too Many Hops");
    exit;
  }
  if ($ml >= 16384)
  {
    xlog("L_NOTICE", "[$ci] sanity: rejecting due to message size\n");
    sl_send_reply( "513", "Message too big");
    exit;
  };
  if ($ua =~ "friendly-scanner")
  {
    xlog("L_NOTICE", "[$ci] blacklist: rejecting due to user-agent\n");
    # do not reply for stealth
    #sl_send_reply("403", "Forbidden");
    exit;
  };

  # OPTIONS processing
  if (is_method( "OPTIONS"))
  {
    # ReachUC clients require response to OPTIONS
    # only reply on internal port(s) for stealth
    #if ($Rp == 6050 || $Rp == 8070)
      sl_send_reply( "200", "OK");
    exit;
  }

  # check for sequential request within a dialog (indicated by tag in To header)
  if (has_totag())
  {
    # sequential request within a dialog should
    # take the path determined by record-routing
    if (loose_route())
    {
      if (is_method( "INVITE|BYE|ACK|CANCEL|REFER|INFO|SUBSCRIBE|NOTIFY"))
      {
        # test for issue with natted uc client
        # fix contact header for natted clients
        #if (client_nat_test("3")) {
        if (client_nat_test("9")) {
          fix_contact();
        }
        # mark routing logic in request
        append_hf("P-hint: rr-enforced\r\n");
        route(sequential);
      }
      else
      {
        xlog("L_WARN", "[$ci] except_lr_method: $rm (from: $fu -> to: $tu)\n");
      }
    }
    # sequential without record-route should be treated as initial
    # (currently only applies to ACK to a failure response)
    else if (is_method( "ACK"))
    {
      # mark routing logic in request
      append_hf("P-hint: rr-absent\r\n");
      route(initial);
    }
    else
    {
      #xlog("L_WARN", " [$ci] except_loose: $rm (from: $fu -> to: $tu)\n");
      append_hf("P-hint: rr-absent\r\n");
      route(initial);
    }
    exit;
  }

  # record routing for dialog forming requests (in case they are routed)
  # remove any preloaded route headers
  remove_hf( "Route");

  # request is an initial request, so apply routing logic

  # check for dialog-forming requests (INVITE and SUBSCRIBE) that should get record-route
  if (is_method( "INVITE|SUBSCRIBE"))
  {
    # test for issue with natted uc client
    # fix contact header for natted clients
    #if (client_nat_test("3")) {
    if (client_nat_test("9")) {
      fix_contact();
    }
    record_route();
    route(initial);
  }
  # allow initial ACK and CANCEL requests as well (outside of dialog)
  else if (is_method( "ACK|CANCEL"))
  {
    # sequential request without to-tag? (ie, ACK to a malformed 407 response without to-tag, or CANCEL request)
    route(initial);
  }
  # check for special-case for REGISTER request
  else if (is_method( "REGISTER"))
  {
    # fix contact header for natted clients
    #if (client_nat_test("3")) {
    if (client_nat_test("9")) {
      fix_contact();
    }
    # TODO determine if record-route is appropriate for REGISTER
    record_route();
    route(register);
  }
  # check for special-case for MESSAGE request
  else if (is_method( "MESSAGE"))
  {
    route(message);
  }
  # check for non-dialog-forming requests
  else if (is_method( "SUBSCRIBE|NOTIFY|INFO"))
  {
    route(initial);
  }
  else
  {
    xlog("L_WARN", "[$ci] unsupported: $rm (from: $fu -> to: $tu)\n");
  }

}

route[sdp] {
  # apply SDP manipulation
  if(has_body("application/sdp")) {
    # fix IPv6 media
    if (search_body("[Cc]=[Ii][Nn] [Ii][Pp]6")) {
      xlog("L_INFO", "[$ci] IPv6 SDP detected in initial request\n");
      #fix_nated_sdp("2", "10.11.12.13");
      #fix_nated_sdp("2");
      # replace ipv6 IP with fake private IP to invoke IP detection logic in NMS
      replace_body_all("[Cc]=[Ii][Nn] [Ii][Pp]6.*", "c=IN IP4 10.11.12.13");
    }
    # fix NS webrtc media
    #if (search_body("[Oo]=[Nn][Ee][Tt][Ss][Aa][Pp][Ii][Ee][Nn][Ss]_[Nn][Mm][Ss].*")) {
    if (false) {
      #xlog("L_INFO", "[$ci] NMS SDP detected\n");
      if (search_body("[Aa]=[Mm][Ss][Ii][Dd]-[Ss][Ee][Mm][Aa][Nn][Tt][Ii][Cc].*")) {
        xlog("L_INFO", "[$ci] removing msid-semantic from NMS SDP\n");
        replace_body_all("[Aa]=[Mm][Ss][Ii][Dd]-[Ss][Ee][Mm][Aa][Nn][Tt][Ii][Cc].*", "a=removed");
      }
      if (search_body("[Aa]=[Mm][Ii][Dd]:[Aa][Uu][Dd][Ii][Oo].*")) {
        xlog("L_INFO", "[$ci] removing mid:audio from NMS SDP\n");
        replace_body_all("[Aa]=[Mm][Ii][Dd]:[Aa][Uu][Dd][Ii][Oo].*", "a=removed");
      }
    }
  }
}

route[transcode]
{

  # check for switch to transcode call
  if (is_method("INVITE|ACK|CANCEL|BYE") && ($Rp == 7080 || $Rp == 8070)) {
    setflag(transcode);
    xlog("L_INFO", "[$ci] *** SET transcode=true\n");
  }

  if (isflagset(transcode)) {

    # create unique rtpengine id for this branch transaction
    #$avp(extra_id) = "branch." + $T_branch_idx;
    #$avp(extra_id) = $hdr(Via) + "." + $T_branch_idx;
    #$avp(extra_id) = $ci + "-" + $T_branch_idx;
    $avp(extra_id) = $ci + "-" + $ft + "-" + $T_branch_idx;
    xlog("L_INFO", "[$ci] extra_id: $avp(extra_id)\n");

    $var(rtpengine_branchflags) = "via-branch=extra";
    xlog("L_INFO", "[$ci] *** SET rtpengine_branchflags=$var(rtpengine_branchflags)\n");

    # codec-mask-CODEC: do not present CODEC to B-side, but use it on A-side
    # codec-strip-CODEC: do not use CODEC for A-side
    # transcode-CODEC: offer CODEC to B-side

    # external interface, coming from client
    if ($Rp == 5060 || $Rp == 7080) {
      # codec-mask-opus: make the A-side receive opus codec, but B-side will use something else
      # codec-strip-PCMU codec-strip-PCMA: prevent the A-side from receiving either PCMU or PCMA codec
      # transcode-PCMU transcode-PCMA: offer either PCMU or PCMA codec to B-side
      #$var(rtpengine_offerflags) = "codec-mask-iLBC codec-mask-opus codec-strip-PCMU codec-strip-PCMA transcode-PCMU transcode-PCMA";
      $var(rtpengine_offerflags) = "via-branch=extra codec-mask-iLBC codec-mask-opus codec-strip-PCMU codec-strip-PCMA transcode-PCMU transcode-PCMA";
    }
    # internal interface, coming from pbx
    else if ($Rp == 6050 || $Rp == 8070) {
      # codec-mask-PCMU codec-mask-PCMA: make the A-side receive PCMU or PCMA codec, but B-side will use something else
      # codec-strip-opus: prevent the A-side from receiving opus codec
      # transcode-opus: offer opus codec to B-side
      #$var(rtpengine_offerflags) = "codec-mask-PCMU codec-mask-PCMA codec-strip-iLBC codec-strip-opus transcode-iLBC transcode-opus";
      $var(rtpengine_offerflags) = "via-branch=extra codec-mask-PCMU codec-mask-PCMA codec-strip-iLBC codec-strip-opus transcode-iLBC transcode-opus";
    }
    xlog("L_INFO", "[$ci] *** SET rtpengine_offerflags=$var(rtpengine_offerflags)\n");
  }
}

route[register]
{

      xlog("L_NOTICE", "[$ci] register_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, contact: $ct)\n");

  if ($rd=~".*.service")
  {
    # handle service non-fqdn domains
    xlog("L_INFO", "[$ci] service_domain: $rd\n");
    # force static route to service proxy
    # service calls routed to <reseller>.hpbx.example.com
    $var(domain) = $rd;
    # route to hostname derived from domain
    $var(proxy) = "sip:" + $(var(domain){s.select,-2,.}) + ".hpbx.example.com";
  }
  else if ($rd=~".*.other.net")
  {
    # handle other domains
    xlog("L_INFO", "[$ci] other_domain: $rd\n");
    # force static route to other proxy
    # other calls routed directly to domain
    $var(proxy) = "sip:" + $rd;
  }
  else
  {
    xlog("L_NOTICE", "[$ci] register unknown_domain: $rd, REJECTING\n");
    # do not reply for stealth
    #sl_send_reply("403", "Forbidden");
    exit;
  }

  # mid-registrar
  #bm_start_timer("mid-reg-save");
  mid_registrar_save("location");
  #mid_registrar_save("location", "r");
  # note call to bm_log_time belows likely disrupts $retcode checked next
  #bm_log_timer("mid-reg-save");
  switch ($retcode) {
  case 1:
    # 1 (success) - current REGISTER request must be dispatched by the script writer over to the main registrar
    xlog("L_INFO", "[$ci] registrar: relaying registration for \"$ru\" to \"$var(proxy)\" [$si:$sp] $ru (from: $fu -> to: $tu)\n");
    $ru = $var(proxy);
    # swap internal and external interfaces
    route(swap);
    xlog("L_NOTICE", "[$ci] register_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, contact: $ct)\n");
    # relay the request
    t_on_reply("transaction");
    t_relay();
    break;
  case 2:
    # 2 (success) - current REGISTER request has been absorbed by the mid-registrar; a 200 OK reply has been sent upstream
    xlog("L_INFO", "[$ci] registrar: absorbing registration for \"$ru\" [$si:$sp] (from: $fu -> to: $tu)\n");
    break;
  default:
    # -1 (error) - generic error code; the logs should provide more help
    xlog("L_WARN", "[$ci] registrar: failed to save registration for \"$ru\"! [$si:$sp] (from: $fu -> to: $tu)\n");
    sl_send_reply("500", "Server Internal Error");
    exit;
  }

}

route[initial]
{

  # maniulate sdp
  route(sdp);

  # external interface
  if ($Rp == 5060 || $Rp == 7080)
  {

    xlog("L_NOTICE", "[$ci] initial_external_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");

    # mark routing logic in request
    append_hf("P-hint: external\r\n");

    if ($rd=~".*.service")
    {
      # handle service non-fqdn domains
      xlog("L_INFO", "[$ci] service_domain: $rd\n");
      # service calls routed to <reseller>.hpbx.example.com
      $var(domain) = $rd;
      # route to hostname derived from domain
      $var(proxy) = "sip:" + $(var(domain){s.select,-2,.}) + ".hpbx.example.com";
      # temporarily routing to alternate host
      #$var(proxy) = "sip:ns-dev01.whitelabelvoip.net";

      # honor proxy request if present
#      xlog("L_INFO", "[$ci] *** X-RUC-Proxy: $hdr(X-RUC-Proxy)\n");
#      if ($(hdr(X-RUC-Proxy){s.len}) > 0) {
#        # set proxy
#        xlog("L_INFO", "[$ci] *** Setting new proxy!\n");
#        $var(proxy) = "sip:" + $hdr(X-RUC-Proxy);
#      }
      xlog("L_INFO", "[$ci] proxy: $var(proxy)\n");
      $du = $var(proxy);
    }
    else if ($rd=~".*.other.net")
    {
      # handle direct domains
      xlog("L_INFO", "[$ci] other_domain: $rd\n");
      $du = "sip:" + $rd;
    }
    else
    {
      xlog("L_NOTICE", "[$ci] initial unknown_domain: $rd, REJECTING\n");
      # do not reply for stealth
      #sl_send_reply("403", "Forbidden");
      exit;
    }

    # limit outgoing call duration to specified limit
    xlog("L_INFO", "[$ci] *** X-RUC-Limit: $hdr(X-RUC-Limit) (len: $(hdr(X-RUC-Limit){s.len}))\n");
#    if ($hdr(X-RUC-Limit)!="0") {
#    if ($(hdr(X-RUC-Limit){s.len}) > 0 && $hdr(X-RUC-Limit)!="0") {
    if ($hdr(X-RUC-Limit) != null && $(hdr(X-RUC-Limit){s.len}) > 0 && $hdr(X-RUC-Limit) != "0") {
      # set outgoing call duration limit
      xlog("L_INFO", "[$ci] *** Setting call duration limit of $hdr(X-RUC-Limit) seconds!\n");
      create_dialog("B");
      $DLG_timeout = $hdr(X-RUC-Limit);
    }

    # NOTE: moved to branch_route[initial]
    # conditionally swap internal and external interfaces
    #route(swap);

    xlog("L_NOTICE", "[$ci] initial_external_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
  }
  # internal interface
  else if ($Rp == 6050 || $Rp == 8070)
  {

    # NOTE: ACL now handled by UFW blocking of internal ports to external IPs

    xlog("L_NOTICE", "[$ci] initial_internal_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");

    # mark routing logic in request
    append_hf("P-hint: internal\r\n");

    # mid-registrar
    xlog("L_INFO", "[$ci] registrar: looking up \"$ru\" $rm [$si:$sp] (from: $fu -> to: $tu)\n");
    if (!mid_registrar_lookup("location")) {
      xlog("L_INFO", "[$ci] registrar: could not find \"$ru\" $rm [$si:$sp] (from: $fu -> to: $tu)\n");
      if (is_method( "INVITE") && $rd=~".*.service")
      {
        # can not route .service domains without registration
        xlog("L_INFO", "[$ci] registrar: rejecting request, \"$ru\" not found in lookup $rm [$si:$sp] (from: $fu -> to: $tu)\n");
        t_reply("404", "Not Found");
        exit;
      }
      # 2.3 and below self-check
      else if (uri == myself) {
      # 2.4 and above self-check
      #else if (is_myself("$rd")) {
        # uri refers to this proxy (probably from a past registration)
        xlog("L_INFO", "[$ci] registrar: rejecting request, \"$ru\" refers to myself: $rm [$si:$sp] (from: $fu -> to: $tu)\n");
        t_reply("404", "Not Found");
        exit;
      }
      xlog("L_INFO", "[$ci] registrar: routing request as addressed to \"$ru\" $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
      # routing as addressed instead of failing, as NOTIFY is using direct uri
      #t_reply("404", "Not Found");
      #exit;
    }
    else {
      xlog("L_INFO", "[$ci] registrar: found \"$ru\", routing to contact $rm [$si:$sp] (from: $fu -> to: $tu)\n");
    }

    # NOTE: moved to branch_route[initial]
    # conditionally swap internal and external interfaces
    #route(swap);

    xlog("L_NOTICE", "[$ci] initial_internal_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
  }
  else
  {
    xlog("L_NOTICE", "[$ci] unknown interface: $rm [$si:$sp] (from: $fu -> to: $tu)\n");
    sl_send_reply( "403", "Forbidden");
    exit;
  }

/*
  # NOTE: moved rtpengine setup to branch_route[]
  # handle rtpengine setup for new call if early media
  route(transcode);
  if (isflagset(transcode)) {
    if (is_method("INVITE") && has_body("application/sdp")) {
      setflag(early);
      xlog("L_INFO", "[$ci] *** RTPEngine INVITE OFFER\n");
      rtpengine_offer("$var(rtpengine_offerflags)");
    }
  }
*/

    # relay the request
    t_on_branch("initial");
    t_on_reply("transaction");
    if (is_method("CANCEL")) {
      # preserve the received Reason header
      t_relay("8");
    }
    else {
      t_relay();
    }
}

branch_route[initial] {

  xlog("L_NOTICE", "[$ci] branch_initial_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");

  # conditionally swap internal and external interfaces
  route(swap);
  # handle rtpengine setup for new call if early media
  route(transcode);
  if (isflagset(transcode)) {
    if (is_method("INVITE") && has_body("application/sdp")) {
      setflag(early);
      xlog("L_INFO", "[$ci] *** RTPEngine INVITE OFFER\n");
      rtpengine_offer("$var(rtpengine_offerflags)");
    }
  }

  xlog("L_NOTICE", "[$ci] branch_initial_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");

}

route[message]
{

  # message received on external interface
  if ($Rp == 5060 || $Rp == 7080)
  {
    xlog("L_NOTICE", "[$ci] message_external_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");

    # mark routing logic in request
    append_hf("P-hint: msg-to-hub\r\n");

    # conditionally swap internal and external interfaces
    route(swap);
    # relay the request
    t_on_reply("transaction");
    t_relay();

    xlog("L_NOTICE", "[$ci] message_external_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");
  }
  # message received on internal interface
  else if ($Rp == 6050 || $Rp == 8070)
  {
    xlog("L_NOTICE", "[$ci] message_internal_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");

    # mark routing logic in request
    append_hf("P-hint: msg-from-hub\r\n");

    # mid-registrar
    xlog("L_INFO", "[$ci] registrar: msg looking up $ru $rm [$si:$sp] (from: $fu -> to: $tu)\n");
    if (!mid_registrar_lookup("location")) {
      xlog("L_INFO", "[$ci] registrar: msg_unregistered: $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
      if ($rd=~".*.service")
      {
        # can not route .service domains without registration
        xlog("L_INFO", "[$ci] registrar: msg_not_found: $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
        t_reply("404", "Not Found");
        exit;
      }
      # 2.3 and below self-check
      if (uri == myself) {
      # 2.4 and above self-check
      #if (is_myself("$rd")) {
        # uri refers to this proxy (probably from a past registration)
        xlog("L_INFO", "[$ci] registrar: msg_this_server: $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
        t_reply("404", "Not Found");
        exit;
      }
      xlog("L_INFO", "[$ci] registrar: msg_as_addressed $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
      # routing as addressed instead of failing, as NOTIFY is using direct uri
      #t_reply("404", "Not Found");
      #exit;
    }
    else {
      xlog("L_INFO", "[$ci] registrar: msg_lookup_succeeded $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
    }

    # conditionally swap internal and external interfaces
    route(swap);
    # relay the request
    t_on_reply("transaction");
    t_relay();

    # route as addressed
    xlog("L_NOTICE", "[$ci] message_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");
  }
}

route[sequential] {

    xlog("L_NOTICE", "[$ci] sequential_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");

  # manipulate sdp
  route(sdp);

  route(transcode);
  if (isflagset(transcode)) {
    # NOTE: ACK may not be considered part of INVITE transaction in which case early flag would be out of scope
    #if (!isflagset(early) && is_method("ACK") && has_body("application/sdp")) {
    if (is_method("ACK") && has_body("application/sdp")) {
      xlog("L_INFO", "[$ci] *** RTPEngine ACK ANSWER\n");
      rtpengine_answer("$var(rtpengine_branchflags)");
    }
    # NOTE: not tearing down rtpengine session on CANCEL because it is
    # sequential and branch is not known, and because branch will reply with
    # failure status that will be associated with branch and this will
    # tear down the rtpengine session for this branch
/*
    else if (is_method("CANCEL")) {
      # handle rtpengine teardown
      xlog("L_INFO", "[$ci] *** RTPEngine CANCEL DELETE\n");
      rtpengine_delete("$var(rtpengine_branchflags)");
    }
*/
    else if (is_method("BYE")) {
      # handle rtpengine teardown
      xlog("L_INFO", "[$ci] *** RTPEngine BYE DELETE\n");
      # NOTE: not passing branch flags to rtpengine_delete here since branch is
      # unknown to sequential request in dialog, and we assume all remaining 
      # rtpengine branches will be torn down as a result.
      rtpengine_delete();
    }
    else if (is_method("INVITE") && has_body("application/sdp")) {
      setflag(early);
      xlog("L_INFO", "[$ci] *** RTPEngine INVITE OFFER\n");
      rtpengine_offer("$var(rtpengine_offerflags)");
    }
  }

  # swap internal and external interfaces
  route(swap);

  # relay the request
  t_on_reply("transaction");
  if (is_method("CANCEL")) {
    # preserve the received Reason header
    t_relay("8");
  }
  else {
    t_relay();
  }

  xlog("L_NOTICE", "[$ci] sequential_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}

route[reply] {

  if (is_method("INVITE") && $rs=~"(180)|(183)|(200)|(202)") {
    #xlog("L_INFO", "[$ci] cross-check: Contact $ct\n");
    #xlog("L_INFO", "[$ci] cross-check: To user \"$tU\" in \"$tu\"\n");
    #xlog("L_INFO", "[$ci] cross-check: Contact user \"$(ct.fields(uri){uri.user})\" in \"$ct.fields(uri)\"\n");
    if($(ct.fields(uri){uri.user}) != null && $tU != $(ct.fields(uri){uri.user})) {
      xlog("L_WARN", "[$ci] cross-check: to-user \"$tU\" in \"$tu\" does not match contact-user \"$(ct.fields(uri){uri.user})\" in \"$ct.fields(uri)\"\n");
    }
  }

  # swap internal and external interfaces
  route(swap);

  # test for issue with natted uc client
  # fix contact header for natted clients
  #if (client_nat_test("3")) {
  if (client_nat_test("9")) {
    fix_contact();
  }

  # maniulate sdp
  route(sdp);

}

onreply_route[transaction] {
  # NOTE: when default onreply_route is present in addition to transaction onreply routes, *both* will be called.

  xlog("L_NOTICE", "[$ci] onreply_transaction_pre: $rs to $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");

  route(transcode);
  if (isflagset(transcode)) {
    if (is_method("INVITE")) {
      #if (t_check_status("1[0-9][0-9]")) {
      if (t_check_status("(180)|(183)|(200)|(202)") && has_body("application/sdp")) {
        if (!isflagset(early)) {
          xlog("L_INFO", "[$ci] *** RTPEngine OK OFFER\n");
          rtpengine_offer("$var(rtpengine_offerflags)");
        }
        else {
          xlog("L_INFO", "[$ci] *** RTPEngine OK ANSWER\n");
          rtpengine_answer("$var(rtpengine_branchflags)");
        }
      }
      else if (t_check_status("[3-9][0-9][0-9]")) {
        # currently disabling delete on failure until support for forking is
        # configured since this delete will tear down the active branch too.
        xlog("L_INFO", "[$ci] *** RTPEngine REJECT DELETE\n");
        rtpengine_delete("$var(rtpengine_branchflags)");
      }
    }
  }

  route(reply);

  xlog("L_NOTICE", "[$ci] onreply_transaction_post: $rs to $rm <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}

local.cfg

# listing external udp interface first so that it is used by mid-registrar
#  when relaying registrations to the main registrar
# external interfaces
listen=udp:1.1.1.1:5060
listen=tcp:1.1.1.1:5060
listen=udp:1.1.1.1:7080
listen=tcp:1.1.1.1:7080
# internal interfaces
listen=udp:1.1.1.1:6050
listen=tcp:1.1.1.1:6050
listen=udp:1.1.1.1:8070
listen=tcp:1.1.1.1:8070

route[swap] {

  # if message received on external port then switch to internal port
  # if message received on internal port then switch to external port

  # if message is destined for udp then switch to udp
  # if message is destined for tcp then switch to tcp

  if ($fs != NULL) {
    xlog( "L_INFO","[$ci] swap: send socket already set to $fs, not swapping $pr:$Ri:$Rp\n" );
    return;
  }

  # swap external and internal interfaces
  if ($Rp == 5060) {
    # if ruri is using tcp then force tcp in send socket
    if ($rP == "tcp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:6050\n" );
      force_send_socket(tcp:1.1.1.1:6050);
    }
    else if ($rP == "udp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:6050\n" );
      force_send_socket(udp:1.1.1.1:6050);
    }
    else {
      xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
    }
  }
  else if ($Rp == 6050) {
    # if ruri is using tcp then force tcp in send socket
    if ($rP == "tcp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:5060\n" );
      force_send_socket(tcp:1.1.1.1:5060);
    }
    else if ($rP == "udp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:5060\n" );
      force_send_socket(udp:1.1.1.1:5060);
    }
    else {
      xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
    }
  }
  else if ($Rp == 7080) {
    # if ruri is using tcp then force tcp in send socket
    if ($rP == "tcp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:8070\n" );
      force_send_socket(tcp:1.1.1.1:8070);
    }
    else if ($rP == "udp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:8070\n" );
      force_send_socket(udp:1.1.1.1:8070);
    }
    else {
      xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
    }
  }
  else if ($Rp == 8070) {
    # if ruri is using tcp then force tcp in send socket
    if ($rP == "tcp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:7080\n" );
      force_send_socket(tcp:1.1.1.1:7080);
    }
    else if ($rP == "udp") {
      xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:7080\n" );
      force_send_socket(udp:1.1.1.1:7080);
    }
    else {
      xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
    }
  }
}
vitalikvoip commented 4 years ago

I have a feeling that I'm really close to figure it out. Have logs and coredump when contact_body_t allocated in PKG mem in one worker and then freed in another what crashed it. It has crashed because I have mem debug and q_malloc detected broken memory block. Without memory debug I suppose it may either crash or corrupt PKG memory. That pkg_malloc was called from dlg_update_contact()->parse_contact() during in-dialog CANCEL processing. Actually opensips allocated it in one worker when processed CANCEL and then tried to free when received 200OK for INVITE (the same transaction) in another worker.

To make it clear want to emphasize that in my case crash happens only when UAC sends reINVITE, received 100 Trying, and sends CANCEL immediately. While this "CANCEL" travels across the network, opensips has enough time to forward reINVITE and than it receives CANCEL from UAC and 200 OK for reINVITE from UAS a few moments later.

UAC                 proxy                               UAS
| ---reINVITE->     |                                    |
|  <-----100---     |                                    |
|                   | ---reINVITE, CSeq: 5 INVITE->      |
| ---CANCEL->       |                                    |
|                   |  <--- 200 OK, CSeq: 5 INVITE ---   |
|                 [ bang ]                               |
vitalikvoip commented 4 years ago

The patch attached fixed it for me. At least the problem I was able to reproduce with SIPP.

diff --git a/modules/dialog/dlg_handlers.c b/modules/dialog/dlg_handlers.c
index 8afca4601..4930dd54b 100644
--- a/modules/dialog/dlg_handlers.c
+++ b/modules/dialog/dlg_handlers.c
@@ -1259,6 +1259,11 @@ static void dlg_update_caller_contact(struct cell* t, int type, struct tmcb_para
                return;
        }

+       if (msg->msg_flags&FL_SHM_CLONE) {
+               LM_ERR("skip SHM allocated msg\n",);
+               return;
+       }
+
        dlg_update_contact(dlg, msg, DLG_CALLER_LEG);
 }

@@ -1275,6 +1280,11 @@ static void dlg_update_callee_contact(struct cell* t, int type, struct tmcb_para
                return;
        }

+       if (msg->msg_flags&FL_SHM_CLONE) {
+               LM_ERR("skip SHM allocated msg\n",);
+               return;
+       }
+
        dlg_update_contact(dlg, msg, callee_idx(dlg));
 }
vitalikvoip commented 4 years ago

Unfortunately, it looks like that was not the only crash related to dialog module in 2.4.7

Got a new crash:

#0  context_get_ptr (type=CONTEXT_DIALOG, ctx=0xfa, pos=0) at ../../context.h:179
#1  0x00007f476306a1be in dlg_ctx_get_ptr (dlg=0x12, pos=0) at dlg_ctx.c:65
#2  0x00007f47634eee01 in free_acc_ctx (ctx=0x7f476e3af640) at acc_logic.c:205
#3  0x00007f47634f37c8 in unref_acc_ctx (ctx=0x7f476e3af640) at acc_logic.c:1287
#4  0x00007f47691a3c54 in empty_tmcb_list (head=0x7f476d64a3d8) at t_hooks.c:53
#5  0x00007f4769188a51 in free_cell (dead_cell=0x7f476d64a368) at h_table.c:127
#6  0x00007f47691b6885 in delete_cell (p_cell=0x7f476d64a368, unlock=1) at timer.c:239
#7  0x00007f47691b750a in wait_handler (wait_tl=0x7f476d64a3e8) at timer.c:458
#8  0x00007f47691b94b8 in timer_routine (ticks=135779, set=0x1) at timer.c:1091
#9  0x00000000004a703e in handle_timer_job () at timer.c:731
#10 0x000000000054806c in handle_io (fm=0x7f47799bf0a8, idx=3, event_type=1) at net/net_udp.c:265
#11 0x0000000000546983 in io_wait_loop_epoll (h=0x9014e0 <_worker_io>, t=1, repeat=0) at net/../io_wait_loop.h:280
#12 0x000000000054896b in udp_start_processes (chd_rank=0x80371c <chd_rank>, startup_done=0x7f476bd40b70) at net/net_udp.c:389
#13 0x000000000043d0c5 in main_loop () at main.c:784
#14 0x000000000043fde7 in main (argc=13, argv=0x7ffe1e40b5f8) at main.c:1442
attermann commented 4 years ago

@vitalikvoip Curious did your patch address the crash in free_contacts() during transaction clean-up, or only the dialog-related crash?

vitalikvoip commented 4 years ago

My patch addresses crash I had in dialog module when had described call flow with CANCEL from caller and 200 OK from callee. Had another one with transaction clean up but it's still related to dialog because crashes when cleans dialog context created by acc module. I don't actually need acc at the moment, but still wanted to fix it since I've already dived deep into this part of opensips, but couldn't reproduce it manually and my production proxies stay stable since last week. Will try to analyze when it happens again.

vitalikvoip commented 4 years ago

It's been working without any issues for 5 days. @attermann if your opensips still crashes and you see in coredumps it happens around free_contacts() I think it worth trying my patch.

bogdan-iancu commented 4 years ago

I suspect that there is a TMCB_RESPONSE_OUT (type 128) TM callback (registered by other module) that is parsing the Contact hdr of the request that was cloned in SHM memory. Now, if you have high load, you may have SIP retransmissions, so you may end up with two 200 reply's handled in different procs (in the same time), both trying to "parse" + "use" + "free" the contact hdr over the same sip msg structure (the cloned, shared request that is saved into transaction). Now, the TMCB_RESPONSE_OUT callbacks are executed under lock, so maybe there is some part of the code (outside that callback) that tries to parse the Contact....and overlaps with the callback There are a couple of modules using the TMCB_RESPONSE_OUT callback and potentially parsing the Contact hdr, so let me try to narrow down the possibilities...

vitalikvoip commented 4 years ago

After applying the patch I've posted above I had no more crashes. So maybe in my case it was only due to CANCEL handling since CANCEL is not end-2-end request and dlg update functions dealt with the cloned sip msg. Haven't thought about retransmissions from this angle. Your assumption makes perfect sense and most likely retransmission maybe be another corner case.

bogdan-iancu commented 4 years ago

You suggest that the concurrent access+parsing+free over the request stored in shmem happens between (1) the 200 OK reply and (2) the CANCEL ?

And your patched eliminated this concurrency by not doing any parsing for the 200 OK ?

vitalikvoip commented 4 years ago

The problem was not in concurrency. Back then I couldn't understand why it crashes and the described case was the only one I knew to reproduce the crash. As far as I understood all parsing functions are supposed to be called only on PKG allocated sip_msg received from a wire and before it is cloned to SHM by tm module. Normally, when tm executes TMCB_REQUEST_BUILT callbacks it provides either PKG allocated sip_msg that was read before or faked_req (do_dns_failover) which is cleaned just after usage in the same process. The CANCEL request is an exception since t_cancel.c:cancel_branch() provides an SHM allocated sip_msg (t->uas.request) to callbacks what makes any parsing dangerous. When it happens, one process handles CANCEL request and parses its contact allocating some info in PKG mem and attaching it to SHM allocated sip_msg. Then after a while another process (for example timer) may try to read that data and dereference those PKG pointers and reads some random unexpected data or writes something that leads to memory corruption.

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Fernandojdk commented 4 years ago

Cause I have a lot of environment running the latest 3.1 from rpm packages, I can take some coredump info to provide more info if needed.

Let me know if this will help.

vitalikvoip commented 4 years ago

I have it running almost a month without any issues while had at least one crash day before applied the patch.

Fernandojdk commented 4 years ago

I’ll try your patch to see if it will fix for me too.

vitalikvoip commented 4 years ago

I’ll try your patch to see if it will fix for me too.

You've mentioned you are using opensips 3.1. I've tested my patch only with 2.4.7 and don't if this part is still the same in 3.1. Anyhow give it a chance and I can take a look at your coredumps closely when have a spare moment.

Fernandojdk commented 4 years ago

I’m able to locate the block of code on 3.1 where i need to apply your patch. Today I’ll try to see if it fix for me on the latest 3.1.

vitalikvoip commented 4 years ago

@Fernandojdk , did you have a chance to check it with 3.1?

Fernandojdk commented 4 years ago

Hello @vitalikvoip

I applied your patch on opensips 3.1, but the bug continued when we are under high load. For 3.1 this patch did not work. Unfortunately. The frequency at which the bug happens has decreased, but still here.

liviuchircu commented 4 years ago

Hey guys,

@razvancrainea came up with the following patch for 2.4 that should address the issue (PKG-parsed contacts intermingling with a SHM-stored SIP message), would you mind giving it a try? We are undergoing some testing as well on it, as of lately.

git apply <(base64 -d <<EOF | gzip -dc
H4sIAMyQP18AA7VVa0/bSBT97PyKS1ZQx688SKEtBUEJUKQ0XSHYL93KMvbEtmrPuJ5x22jhv++9
fuRVAlRqpcSex52Zc849cx3E0ynYdhgr8LqpCIqEyW4Qe4kIu0ESupHHg4Tl0vHh9vH5VswD9gN2
9wZs1/cGvuP02WB/Ohh6AfR7vb3hsGXb9lOntEzTfPKk42OwXw6tPTDxuQ/Hxy34K+Z+UgQM2o7T
xV/gKc9NijRzovbPs5mXS5ZXL1eJp2N8yb5SlLkhyhdceb5q3o9t2ITWGz+9oArM8+ejJH36g94r
q48SYeO1NSCVQCpPxT7EPIk5w5cCErbIUKw5El2qvPCrGZ8lCRjYsqAelXHmpjIEAx9WC/5rgYYz
UK89wK4feTkYKs0OWqZGR+RMwSH0qFuHuQoMnwYnN+MxjuOqeAr6Fu5pH9UxsLODw5qmV8wi5gWY
e52OhfejK/f04+T65PTaPbeg14G30IO7O1jZodMp4WnjD+7o3YXevpz8c3l9BiKHm79HJ9j6HqtI
FOj7Bj7YwIWCUo+Yh1v/8naHKGlIocg5cbC1e2CJRPHWAdtHJdIAgcNKavUVUASVgNklsMvJ+ccN
yOA2F18YfxybvYzN1O5J8k3A6FRTa3KA6q8G3YpgRntoKo9TN0G98Rh9p8FdTtHeK9zkPMCCnSV2
GPs76NE2y/xKgksMCHeRx+XU/DhM9hs4FwUPoOB1Tpoln7YdQ37Gza1myEkYX3RkeWid4jW5dL2x
L0lFHu48KPN8RNbgAGEDpuq3bEXXMhH+FxcbOv5Lg1JesG0fJSyUn/Dx2Vkwqm5B1yDLqoiB9FIG
86wFgr/Ayy7A4zMVofRgdOv6MXhpvaLysbtv/Znqgdml6sH9NNsIfyk3KznrwOFh5TXQFnf8RhKD
dZbwltJ+BFO0YPVBge0MBAc8CrYDskMFRnvYFBaUJDC6unJlEgq+lgZ73axaKJQAxoOyhtzTo2Q5
TbxQwt0hjMYX7vn45MI9fX8yuTgbHSyXqyuWJZ5PdEQS/AqTKnm7e3XxH+5b/dd/KH1Y5dHNMkrd
1EtQDn05QY0xtzBqUYrPrq70Nl11xkURRpCyVOSzkhBn3xuei/K2SepaabtfSl19Y+rOsuz31ZV5
0Fq4ovxMbYxAFlRkFj2KRcB+Nvt1v9LauU1L0RdZRbzPtOgzDGrWapAYqMKbumg8IGKtYakaZYoK
+DRnK9VdNTtSJL6QBlWzxk3fRBys+Ah9wPJ1O5GVDMBqQ65Ts4zNLaVS/9bFauelEoxMdlr/AycT
XeKQCgAA
EOF
)

LE: I have edited the above patch, do not use the version included in the email! :)

Fernandojdk commented 4 years ago

Hello guys,

I've changed the code posted by @liviuchircu and ported to 3.1 to see if it will fix this issue.

On the next week i can tell if this fix works.

razvancrainea commented 4 years ago

We've tested this in several environments and crashes stopped to happened, hence I believe this issue is fixed. If you're still facing this problem, please re-open the ticket with further info.

Fernandojdk commented 4 years ago

Hello guys.

Just for feedback. I already adapted the above fix to work with 3.1 and until now we have no crash anymore. So i think this issue is fixed.

Thanks all 🥇

attermann commented 4 years ago

Hi @liviuchircu. I was able to apply your patch above to 2.4.7 and will be testing it in production over the next week. I also tried applying it to 2.4.8 but it failed. Is this patch already in 2.4.8?

liviuchircu commented 4 years ago

Yes it is, @razvancrainea pushed the same fix in df68e3b19