Closed attermann closed 4 years ago
I would really appreciate a response to this bug report. If more details are required to troubleshoot this I will provide whatever I can. I am also happy to work on a fix for this myself but I'd like to hear back first to know whether perhaps this has already been fixed since there was a similar report to mine.
It's worth noting that since reverting back to 2.3 (2.3.6 to be exact) we have not encountered this segfault again.
hey @attermann -- we're pretty busy these days, getting the 3.1 release ready. Expect this issue to get more attention once we hit the "feature freeze" point, within less than a couple of weeks.
Meanwhile, if you can reproduce this crash in a lab, I highly recommend you use the git bisect
feature and help us locate the commit which broke it. For example, on latest 2.4
branch, you could search for the commit which broke it using this efficient technique:
git bisect start # begin hunting the bad commit
git bisect bad # mark the current, last commit as "bad"
git checkout a28b3d4ae5c9 # go to your 2.3.x "found-to-be-good" commit
git bisect good # actually mark this commit as "good"
# at this point, git will select a half-way commit. Re-compile and re-test. Based on your results, do either:
git bisect good / git bisect bad
# at this point, git will select a half-way commit. Re-compile and re-test. Based on your results, do either:
git bisect good / git bisect bad
...
# keep going until you find the exact commit which broke things. This will take logarithmic time (say, 5-8 steps)
Cheers!
Also have this problem and can't reproduce it intentionally. Used 2.4.5 and have another issue:
May 20 12:04:18 west /usr/sbin/opensips[1745]: CRITICAL:core:io_watch_add: #012>>> [TCP_main] BUG trying to overwrite entry 738 in the hash(738, 19, 0x7fbab137ea38,1) with (738, 19, 0x7fbab1235138,1)#012#012It seems you have hit a programming bug. Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
Updated to 2.4.7 and have these crashes. It crashes either when tm wants to clean a transaction or in dialog when it wants to get a contact from sip_msg->contact->parsed. "sip_msg->contact->parsed" looks like a valid pointer but the memory it points to looks overwritten:
#3 0x000000000050cb6d in free_contact (_c=0x7f81ba070a20) at parser/contact/parse_contact.c:107
107 parser/contact/parse_contact.c: No such file or directory.
(gdb) p _c
$1 = (contact_body_t **) 0x7f81ba070a20
(gdb) p **_c
$2 = {star = 0 '\000', contacts = 0x63203a73253a6572}
#0 0x00000000004d7fd4 in do_free_params (_p=0xa736e6f69, _shm=0) at parser/parse_param.c:453
#1 0x00000000004d8048 in free_params (_p=0xa736e6f69) at parser/parse_param.c:465
#2 0x000000000050c22f in free_contacts (_c=0x7f81c697b058) at parser/contact/contact.c:298
#3 0x000000000050cb6d in free_contact (_c=0x7f81ba070a20) at parser/contact/parse_contact.c:107
#4 0x00000000004c2f59 in clean_hdr_field (hf=0x7f81ba0709f0) at parser/hf.c:87
#5 0x00007f81b60f20b8 in clean_msg_clone (msg=0x7f81ba06e360, min=0x7f81ba06e360, max=0x7f81ba0710f0) at sip_msg.h:116
#6 0x00007f81b60f2cc0 in run_trans_callbacks_locked (type=128, trans=0x7f81ba0d8bf0, req=0x7f81ba06e360, rpl=0x7f81c697b188, code=200) at t_hooks.c:265
#7 0x00007f81b6101e0d in relay_reply (t=0x7f81ba0d8bf0, p_msg=0x7f81c697b188, branch=0, msg_status=200, cancel_bitmap=0x7ffc7517ad2c) at t_reply.c:1358
#8 0x00007f81b610307c in reply_received (p_msg=0x7f81c697b188) at t_reply.c:1608
#9 0x00000000004364a6 in forward_reply (msg=0x7f81c697b188) at forward.c:495
#10 0x0000000000474e2d in receive_msg (
buf=0x8dd860 <tcp_current_req> "SIP/2.0 200 OK\r\nCSeq: 3 INVITE\r\nCall-ID: 4fb0a15e993d8b7a-c8a2fa2d29abf7c7\r\nFrom: \"jhon.doe\" <sip:123123123@john.com>;tag=241273b736d581e8\r\nTo: <sip:333222111@john.com>;tag=60147127_c9b66bbf"..., len=4711,
rcv_info=0x7ffc7517aef0, existing_context=0x0, flags=0) at receive.c:258
#11 0x000000000054c7f2 in tcp_handle_req (req=0x8dd860 <tcp_current_req>, con=0x7f81b9041f48, _max_msg_chunks=10) at net/proto_tcp/tcp_common.h:411
#12 0x0000000000551f3a in tcp_read_req (con=0x7f81b9041f48, bytes_read=0x7ffc7517b0ac) at net/proto_tcp/proto_tcp.c:1275
#13 0x000000000053fb02 in handle_io (fm=0x7f81c690c0f0, idx=0, event_type=1) at net/net_tcp_proc.c:255
#14 0x000000000053e055 in io_wait_loop_epoll (h=0x8fe4e0 <_worker_io>, t=2, repeat=0) at net/../io_wait_loop.h:280
#15 0x0000000000540c1a in tcp_worker_proc_loop () at net/net_tcp_proc.c:401
#16 0x00000000005380bc in tcp_start_processes (chd_rank=0x80071c <chd_rank>, startup_done=0x7f81b8c8db70) at net/net_tcp.c:2063
#17 0x000000000043ce44 in main_loop () at main.c:790
#18 0x000000000043fae9 in main (argc=13, argv=0x7ffc7517b3e8) at main.c:1442
@attermann do you use dialog, rtpengine, rest_client modules?
I've briefly checked what changed and seen some contact related changes in dialog module. It's just an assumption and I have no proofs. I've rebuilt opensips with q_malloc and memdbg and sometimes it notifies about overwritten memory chunks.
@vitalikvoip Yes I do use the dialog module but only on a small percentage of total calls. I do have one core that crashed in dialog though. Backtrace below for reference.
#0 0x0000000000523d1f in fm_remove_free (n=0x7f076df73418, qm=0x7f076de7a010)
at mem/f_malloc.c:209
pf = 0x0
hash = 0
#1 fm_malloc (qm=0x7f076de7a010, size=0) at mem/f_malloc.c:438
frag = 0x7f076df73418
n = <optimized out>
hash = 0
__FUNCTION__ = "fm_malloc"
#2 0x00007f074bbe18bc in build_extra_hdr (cell=0x7f0752fd9890,
str_hdr=0x7ffce3f9e2d0, extra_hdrs=0x0) at dlg_req_within.c:337
p = <optimized out>
#3 dlg_end_dlg (dlg=dlg@entry=0x7f0752fd9890,
extra_hdrs=extra_hdrs@entry=0x0, send_byes=send_byes@entry=1)
at dlg_req_within.c:459
str_hdr = {s = 0x0, len = 0}
t = 0x10
i = <optimized out>
res = 0
__FUNCTION__ = "dlg_end_dlg"
#4 0x00007f074bbf8d4d in dlg_ontimeout (tl=0x7f0752fd98e0)
at dlg_handlers.c:2095
old_ctx = <optimized out>
new_ctx = <optimized out>
dlg = 0x7f0752fd9890
new_state = 0
old_state = 757703520
unref = 5
do_expire_actions = 1
__FUNCTION__ = "dlg_ontimeout"
#5 0x00007f074bbdcac9 in dlg_timer_routine (ticks=<optimized out>,
attr=<optimized out>) at dlg_timer.c:602
tl = 0xffffffffffffffff
ctl = 0x7f0752fd98e0
__FUNCTION__ = "dlg_timer_routine"
#6 0x00000000004ba09c in handle_timer_job () at timer.c:731
t = 0x7f074e186440
__FUNCTION__ = "handle_timer_job"
#7 0x00000000005e874a in handle_io (fm=0x7f076debfae8, idx=idx@entry=117,
event_type=event_type@entry=1) at net/net_tcp_proc.c:134
ret = 0
n = <optimized out>
con = <optimized out>
s = 302
rw = <optimized out>
resp = <optimized out>
response = {139669392015104, 1}
__FUNCTION__ = "handle_io"
#8 0x00000000005ea88b in io_wait_loop_epoll (h=<optimized out>,
t=<optimized out>, repeat=<optimized out>) at net/../io_wait_loop.h:280
ep_event = {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0, u64 = 0}}
ret = <optimized out>
n = 1
r = 117
e = <optimized out>
fd = <optimized out>
i = <optimized out>
#9 tcp_worker_proc_loop () at net/net_tcp_proc.c:396
__FUNCTION__ = "tcp_worker_proc_loop"
#10 0x00000000005f87fb in tcp_start_processes (
chd_rank=chd_rank@entry=0x998104 <chd_rank>,
startup_done=startup_done@entry=0x0) at net/net_tcp.c:1909
r = 23
reader_fd = {187, 189}
pid = <optimized out>
__FUNCTION__ = "tcp_start_processes"
#11 0x000000000041d343 in main_loop () at main.c:790
startup_done = 0x0
chd_rank = 40
rc = <optimized out>
#12 main (argc=<optimized out>, argv=<optimized out>) at main.c:1442
cfg_stream = <optimized out>
c = <optimized out>
r = <optimized out>
tmp = 0x7ffce3f9eee1 ""
tmp_len = <optimized out>
port = <optimized out>
proto = <optimized out>
protos_no = <optimized out>
options = 0x639290 "f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:P:G:W:o:"
ret = -1
seed = 4044006346
__FUNCTION__ = "main"
@vitalikvoip and @attermann : I did some stress tests on 2.4.7
(f8e19b62a) with "dialog" and "topology_hiding", but could not find any crash. If I remember correctly, the free_contacts()
crash resembles some kind of PKG memory corruption, so any module could be the culprit, really.
Bottom line: please provide either your opensips.cfg
or a minimal version of it that still triggers the crash (ship to liviu@opensips.org if you cannot make it public), so I can study/compare them and get closer to a diagnosis on this memory corruption. Alternatively, if you can help me reproduce the crash on my lab setup, the bug is 80%+ solved.
Cheers!
@liviuchircu Unfortunately I haven't been able to reproduce outside of production so I'm going to have to send my whole config. I trimmed a lot of commented sections and anonymized so hopefully I didn't break it. Note that most of the "external" connections are TCP, and "internal" are UDP.
opensips.cfg:
####### Global Parameters #########
#log_level=2
log_level=3
#log_level=4
log_stderror=no
log_facility=LOG_LOCAL1
children=4
/* uncomment the following line to enable debugging */
#debug_mode=yes
/* comment the next line to enable the auto discovery of local aliases
based on revers DNS on IPs */
auto_aliases=no
# CBA
check_via=no
dns=no
rev_dns=no
dns_try_ipv6=no
dns_retr_time=1
dns_retr_no=3
disable_dns_blacklist=yes
disable_dns_failover=no
dns_use_search_list=no
# CBA TCP related params
open_files_limit=16384
#tcp_children=16
tcp_children=32
tcp_max_connections=16384
# test params for TCP connection keep-alive (to resolve webrtc client port changing by setting connection lifetime to exceed registration interval)
#tcp_connection_lifetime=130
tcp_connection_lifetime=610
#tcp_keepalive=1
#tcp_keepcount=3
#tcp_keepidle=300
#tcp_keepinterval=300
# CBA performance checks
event_pkg_threshold = 90
event_shm_threshold = 90
#exec_dns_threshold = 60000
exec_dns_threshold = 100000 /* 100 milliseconds */
#exec_msg_threshold = 60000
exec_msg_threshold = 1000000 /* 1 second */
#tcp_threshold = 60000
tcp_threshold = 100000 /* 100 milliseconds */
# include local config
include_file "local.cfg"
# ------------------ module loading ----------------------------------
#mpath="/usr/lib/x86_64-linux-gnu/opensips/modules"
mpath="/usr/local/opensips/lib64/opensips/modules"
#### SIGNALING module
loadmodule "signaling.so"
#### StateLess module
loadmodule "sl.so"
#### Transaction Module
loadmodule "tm.so"
modparam("tm", "fr_timeout", 5)
#modparam("tm", "fr_inv_timeout", 30)
modparam("tm", "fr_inv_timeout", 120)
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "onreply_avp_mode", 0)
#### Record Route Module
loadmodule "rr.so"
/* do not append from tag to the RR (no need for this script) */
modparam("rr", "append_fromtag", 0)
#### MAX ForWarD module
loadmodule "maxfwd.so"
#### SIP MSG OPerationS module
loadmodule "sipmsgops.so"
#### FIFO Management Interface
loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_name", "/tmp/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0666)
#### URI module
loadmodule "uri.so"
modparam("uri", "use_uri_table", 0)
#### UDP protocol
loadmodule "proto_udp.so"
loadmodule "proto_tcp.so"
#### TEXTOPS module
loadmodule "textops.so"
#### DbText module
loadmodule "db_text.so"
modparam("db_text", "db_mode", 1)
#### nat traversal module
loadmodule "nat_traversal.so"
modparam("nat_traversal", "keepalive_state_file", "/var/run/opensips/keepalive_state")
##### RTPengine module
loadmodule "rtpengine.so"
modparam("rtpengine", "extra_id_pv", "$avp(extra_id)")
# "rtpengine_sock" is only means to specify rtpengine server in 2.3 and below
modparam("rtpengine", "rtpengine_sock", "udp:1.1.1.2:2223 udp:1.1.1.3:2223")
# db_url" for specifying rtpengine servers is only supported in 2.4 and above
#modparam("rtpengine", "db_url", "text:///etc/opensips/dbtext")
##### user location module
loadmodule "usrloc.so"
modparam("usrloc", "use_domain", 1)
modparam("usrloc", "nat_bflag", "NAT")
modparam("usrloc", "db_mode", 0)
#modparam("usrloc", "db_mode", 1)
#modparam("usrloc", "db_url", "mysql://opensips:opensipsrw@localhost/opensips")
##### registrar module
loadmodule "registrar.so"
/* uncomment the next line not to allow more than 10 contacts per AOR */
modparam("registrar", "max_contacts", 10)
##### Mid-Registrar module
loadmodule "mid_registrar.so"
modparam("mid_registrar", "mode", 2) /* 0 = contact-mirroring / 1 = contact-throttling / 2 = aor-throttling */
modparam("mid_registrar", "outgoing_expires", 7200)
#modparam("mid_registrar", "insertion_mode", 0) /* 0 = contact; 1 = path */
/* uncomment the next line not to allow more than 10 contacts per AOR */
modparam("mid_registrar", "max_contacts", 10)
# CBA set TCP connections to close only after reg expiry
modparam("mid_registrar", "tcp_persistent_flag", 10)
##### dialog module (for call duration limits)
loadmodule "dialog.so"
##### dns_cache module
loadmodule "cachedb_local.so"
modparam("cachedb_local", "cachedb_url", "local://")
loadmodule "dns_cache.so"
modparam("dns_cache", "cachedb_url", "local://")
modparam("dns_cache", "blacklist_timeout", 10) # 10 seconds
####### Routing Logic ########
# main request routing logic
route {
# perform initial message validation
if (!mf_process_maxfwd_header("10"))
{
xlog("L_NOTICE", "[$ci] sanity: rejecting due to maxfwd\n");
sl_send_reply( "483", "Too Many Hops");
exit;
}
if ($ml >= 16384)
{
xlog("L_NOTICE", "[$ci] sanity: rejecting due to message size\n");
sl_send_reply( "513", "Message too big");
exit;
};
if ($ua =~ "friendly-scanner")
{
xlog("L_NOTICE", "[$ci] blacklist: rejecting due to user-agent\n");
# do not reply for stealth
#sl_send_reply("403", "Forbidden");
exit;
};
# OPTIONS processing
if (is_method( "OPTIONS"))
{
# ReachUC clients require response to OPTIONS
# only reply on internal port(s) for stealth
#if ($Rp == 6050 || $Rp == 8070)
sl_send_reply( "200", "OK");
exit;
}
# check for sequential request within a dialog (indicated by tag in To header)
if (has_totag())
{
# sequential request within a dialog should
# take the path determined by record-routing
if (loose_route())
{
if (is_method( "INVITE|BYE|ACK|CANCEL|REFER|INFO|SUBSCRIBE|NOTIFY"))
{
# test for issue with natted uc client
# fix contact header for natted clients
#if (client_nat_test("3")) {
if (client_nat_test("9")) {
fix_contact();
}
# mark routing logic in request
append_hf("P-hint: rr-enforced\r\n");
route(sequential);
}
else
{
xlog("L_WARN", "[$ci] except_lr_method: $rm (from: $fu -> to: $tu)\n");
}
}
# sequential without record-route should be treated as initial
# (currently only applies to ACK to a failure response)
else if (is_method( "ACK"))
{
# mark routing logic in request
append_hf("P-hint: rr-absent\r\n");
route(initial);
}
else
{
#xlog("L_WARN", " [$ci] except_loose: $rm (from: $fu -> to: $tu)\n");
append_hf("P-hint: rr-absent\r\n");
route(initial);
}
exit;
}
# record routing for dialog forming requests (in case they are routed)
# remove any preloaded route headers
remove_hf( "Route");
# request is an initial request, so apply routing logic
# check for dialog-forming requests (INVITE and SUBSCRIBE) that should get record-route
if (is_method( "INVITE|SUBSCRIBE"))
{
# test for issue with natted uc client
# fix contact header for natted clients
#if (client_nat_test("3")) {
if (client_nat_test("9")) {
fix_contact();
}
record_route();
route(initial);
}
# allow initial ACK and CANCEL requests as well (outside of dialog)
else if (is_method( "ACK|CANCEL"))
{
# sequential request without to-tag? (ie, ACK to a malformed 407 response without to-tag, or CANCEL request)
route(initial);
}
# check for special-case for REGISTER request
else if (is_method( "REGISTER"))
{
# fix contact header for natted clients
#if (client_nat_test("3")) {
if (client_nat_test("9")) {
fix_contact();
}
# TODO determine if record-route is appropriate for REGISTER
record_route();
route(register);
}
# check for special-case for MESSAGE request
else if (is_method( "MESSAGE"))
{
route(message);
}
# check for non-dialog-forming requests
else if (is_method( "SUBSCRIBE|NOTIFY|INFO"))
{
route(initial);
}
else
{
xlog("L_WARN", "[$ci] unsupported: $rm (from: $fu -> to: $tu)\n");
}
}
route[sdp] {
# apply SDP manipulation
if(has_body("application/sdp")) {
# fix IPv6 media
if (search_body("[Cc]=[Ii][Nn] [Ii][Pp]6")) {
xlog("L_INFO", "[$ci] IPv6 SDP detected in initial request\n");
#fix_nated_sdp("2", "10.11.12.13");
#fix_nated_sdp("2");
# replace ipv6 IP with fake private IP to invoke IP detection logic in NMS
replace_body_all("[Cc]=[Ii][Nn] [Ii][Pp]6.*", "c=IN IP4 10.11.12.13");
}
# fix NS webrtc media
#if (search_body("[Oo]=[Nn][Ee][Tt][Ss][Aa][Pp][Ii][Ee][Nn][Ss]_[Nn][Mm][Ss].*")) {
if (false) {
#xlog("L_INFO", "[$ci] NMS SDP detected\n");
if (search_body("[Aa]=[Mm][Ss][Ii][Dd]-[Ss][Ee][Mm][Aa][Nn][Tt][Ii][Cc].*")) {
xlog("L_INFO", "[$ci] removing msid-semantic from NMS SDP\n");
replace_body_all("[Aa]=[Mm][Ss][Ii][Dd]-[Ss][Ee][Mm][Aa][Nn][Tt][Ii][Cc].*", "a=removed");
}
if (search_body("[Aa]=[Mm][Ii][Dd]:[Aa][Uu][Dd][Ii][Oo].*")) {
xlog("L_INFO", "[$ci] removing mid:audio from NMS SDP\n");
replace_body_all("[Aa]=[Mm][Ii][Dd]:[Aa][Uu][Dd][Ii][Oo].*", "a=removed");
}
}
}
}
route[transcode]
{
# check for switch to transcode call
if (is_method("INVITE|ACK|CANCEL|BYE") && ($Rp == 7080 || $Rp == 8070)) {
setflag(transcode);
xlog("L_INFO", "[$ci] *** SET transcode=true\n");
}
if (isflagset(transcode)) {
# create unique rtpengine id for this branch transaction
#$avp(extra_id) = "branch." + $T_branch_idx;
#$avp(extra_id) = $hdr(Via) + "." + $T_branch_idx;
#$avp(extra_id) = $ci + "-" + $T_branch_idx;
$avp(extra_id) = $ci + "-" + $ft + "-" + $T_branch_idx;
xlog("L_INFO", "[$ci] extra_id: $avp(extra_id)\n");
$var(rtpengine_branchflags) = "via-branch=extra";
xlog("L_INFO", "[$ci] *** SET rtpengine_branchflags=$var(rtpengine_branchflags)\n");
# codec-mask-CODEC: do not present CODEC to B-side, but use it on A-side
# codec-strip-CODEC: do not use CODEC for A-side
# transcode-CODEC: offer CODEC to B-side
# external interface, coming from client
if ($Rp == 5060 || $Rp == 7080) {
# codec-mask-opus: make the A-side receive opus codec, but B-side will use something else
# codec-strip-PCMU codec-strip-PCMA: prevent the A-side from receiving either PCMU or PCMA codec
# transcode-PCMU transcode-PCMA: offer either PCMU or PCMA codec to B-side
#$var(rtpengine_offerflags) = "codec-mask-iLBC codec-mask-opus codec-strip-PCMU codec-strip-PCMA transcode-PCMU transcode-PCMA";
$var(rtpengine_offerflags) = "via-branch=extra codec-mask-iLBC codec-mask-opus codec-strip-PCMU codec-strip-PCMA transcode-PCMU transcode-PCMA";
}
# internal interface, coming from pbx
else if ($Rp == 6050 || $Rp == 8070) {
# codec-mask-PCMU codec-mask-PCMA: make the A-side receive PCMU or PCMA codec, but B-side will use something else
# codec-strip-opus: prevent the A-side from receiving opus codec
# transcode-opus: offer opus codec to B-side
#$var(rtpengine_offerflags) = "codec-mask-PCMU codec-mask-PCMA codec-strip-iLBC codec-strip-opus transcode-iLBC transcode-opus";
$var(rtpengine_offerflags) = "via-branch=extra codec-mask-PCMU codec-mask-PCMA codec-strip-iLBC codec-strip-opus transcode-iLBC transcode-opus";
}
xlog("L_INFO", "[$ci] *** SET rtpengine_offerflags=$var(rtpengine_offerflags)\n");
}
}
route[register]
{
xlog("L_NOTICE", "[$ci] register_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, contact: $ct)\n");
if ($rd=~".*.service")
{
# handle service non-fqdn domains
xlog("L_INFO", "[$ci] service_domain: $rd\n");
# force static route to service proxy
# service calls routed to <reseller>.hpbx.example.com
$var(domain) = $rd;
# route to hostname derived from domain
$var(proxy) = "sip:" + $(var(domain){s.select,-2,.}) + ".hpbx.example.com";
}
else if ($rd=~".*.other.net")
{
# handle other domains
xlog("L_INFO", "[$ci] other_domain: $rd\n");
# force static route to other proxy
# other calls routed directly to domain
$var(proxy) = "sip:" + $rd;
}
else
{
xlog("L_NOTICE", "[$ci] register unknown_domain: $rd, REJECTING\n");
# do not reply for stealth
#sl_send_reply("403", "Forbidden");
exit;
}
# mid-registrar
#bm_start_timer("mid-reg-save");
mid_registrar_save("location");
#mid_registrar_save("location", "r");
# note call to bm_log_time belows likely disrupts $retcode checked next
#bm_log_timer("mid-reg-save");
switch ($retcode) {
case 1:
# 1 (success) - current REGISTER request must be dispatched by the script writer over to the main registrar
xlog("L_INFO", "[$ci] registrar: relaying registration for \"$ru\" to \"$var(proxy)\" [$si:$sp] $ru (from: $fu -> to: $tu)\n");
$ru = $var(proxy);
# swap internal and external interfaces
route(swap);
xlog("L_NOTICE", "[$ci] register_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, contact: $ct)\n");
# relay the request
t_on_reply("transaction");
t_relay();
break;
case 2:
# 2 (success) - current REGISTER request has been absorbed by the mid-registrar; a 200 OK reply has been sent upstream
xlog("L_INFO", "[$ci] registrar: absorbing registration for \"$ru\" [$si:$sp] (from: $fu -> to: $tu)\n");
break;
default:
# -1 (error) - generic error code; the logs should provide more help
xlog("L_WARN", "[$ci] registrar: failed to save registration for \"$ru\"! [$si:$sp] (from: $fu -> to: $tu)\n");
sl_send_reply("500", "Server Internal Error");
exit;
}
}
route[initial]
{
# maniulate sdp
route(sdp);
# external interface
if ($Rp == 5060 || $Rp == 7080)
{
xlog("L_NOTICE", "[$ci] initial_external_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
# mark routing logic in request
append_hf("P-hint: external\r\n");
if ($rd=~".*.service")
{
# handle service non-fqdn domains
xlog("L_INFO", "[$ci] service_domain: $rd\n");
# service calls routed to <reseller>.hpbx.example.com
$var(domain) = $rd;
# route to hostname derived from domain
$var(proxy) = "sip:" + $(var(domain){s.select,-2,.}) + ".hpbx.example.com";
# temporarily routing to alternate host
#$var(proxy) = "sip:ns-dev01.whitelabelvoip.net";
# honor proxy request if present
# xlog("L_INFO", "[$ci] *** X-RUC-Proxy: $hdr(X-RUC-Proxy)\n");
# if ($(hdr(X-RUC-Proxy){s.len}) > 0) {
# # set proxy
# xlog("L_INFO", "[$ci] *** Setting new proxy!\n");
# $var(proxy) = "sip:" + $hdr(X-RUC-Proxy);
# }
xlog("L_INFO", "[$ci] proxy: $var(proxy)\n");
$du = $var(proxy);
}
else if ($rd=~".*.other.net")
{
# handle direct domains
xlog("L_INFO", "[$ci] other_domain: $rd\n");
$du = "sip:" + $rd;
}
else
{
xlog("L_NOTICE", "[$ci] initial unknown_domain: $rd, REJECTING\n");
# do not reply for stealth
#sl_send_reply("403", "Forbidden");
exit;
}
# limit outgoing call duration to specified limit
xlog("L_INFO", "[$ci] *** X-RUC-Limit: $hdr(X-RUC-Limit) (len: $(hdr(X-RUC-Limit){s.len}))\n");
# if ($hdr(X-RUC-Limit)!="0") {
# if ($(hdr(X-RUC-Limit){s.len}) > 0 && $hdr(X-RUC-Limit)!="0") {
if ($hdr(X-RUC-Limit) != null && $(hdr(X-RUC-Limit){s.len}) > 0 && $hdr(X-RUC-Limit) != "0") {
# set outgoing call duration limit
xlog("L_INFO", "[$ci] *** Setting call duration limit of $hdr(X-RUC-Limit) seconds!\n");
create_dialog("B");
$DLG_timeout = $hdr(X-RUC-Limit);
}
# NOTE: moved to branch_route[initial]
# conditionally swap internal and external interfaces
#route(swap);
xlog("L_NOTICE", "[$ci] initial_external_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}
# internal interface
else if ($Rp == 6050 || $Rp == 8070)
{
# NOTE: ACL now handled by UFW blocking of internal ports to external IPs
xlog("L_NOTICE", "[$ci] initial_internal_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
# mark routing logic in request
append_hf("P-hint: internal\r\n");
# mid-registrar
xlog("L_INFO", "[$ci] registrar: looking up \"$ru\" $rm [$si:$sp] (from: $fu -> to: $tu)\n");
if (!mid_registrar_lookup("location")) {
xlog("L_INFO", "[$ci] registrar: could not find \"$ru\" $rm [$si:$sp] (from: $fu -> to: $tu)\n");
if (is_method( "INVITE") && $rd=~".*.service")
{
# can not route .service domains without registration
xlog("L_INFO", "[$ci] registrar: rejecting request, \"$ru\" not found in lookup $rm [$si:$sp] (from: $fu -> to: $tu)\n");
t_reply("404", "Not Found");
exit;
}
# 2.3 and below self-check
else if (uri == myself) {
# 2.4 and above self-check
#else if (is_myself("$rd")) {
# uri refers to this proxy (probably from a past registration)
xlog("L_INFO", "[$ci] registrar: rejecting request, \"$ru\" refers to myself: $rm [$si:$sp] (from: $fu -> to: $tu)\n");
t_reply("404", "Not Found");
exit;
}
xlog("L_INFO", "[$ci] registrar: routing request as addressed to \"$ru\" $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
# routing as addressed instead of failing, as NOTIFY is using direct uri
#t_reply("404", "Not Found");
#exit;
}
else {
xlog("L_INFO", "[$ci] registrar: found \"$ru\", routing to contact $rm [$si:$sp] (from: $fu -> to: $tu)\n");
}
# NOTE: moved to branch_route[initial]
# conditionally swap internal and external interfaces
#route(swap);
xlog("L_NOTICE", "[$ci] initial_internal_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}
else
{
xlog("L_NOTICE", "[$ci] unknown interface: $rm [$si:$sp] (from: $fu -> to: $tu)\n");
sl_send_reply( "403", "Forbidden");
exit;
}
/*
# NOTE: moved rtpengine setup to branch_route[]
# handle rtpengine setup for new call if early media
route(transcode);
if (isflagset(transcode)) {
if (is_method("INVITE") && has_body("application/sdp")) {
setflag(early);
xlog("L_INFO", "[$ci] *** RTPEngine INVITE OFFER\n");
rtpengine_offer("$var(rtpengine_offerflags)");
}
}
*/
# relay the request
t_on_branch("initial");
t_on_reply("transaction");
if (is_method("CANCEL")) {
# preserve the received Reason header
t_relay("8");
}
else {
t_relay();
}
}
branch_route[initial] {
xlog("L_NOTICE", "[$ci] branch_initial_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
# conditionally swap internal and external interfaces
route(swap);
# handle rtpengine setup for new call if early media
route(transcode);
if (isflagset(transcode)) {
if (is_method("INVITE") && has_body("application/sdp")) {
setflag(early);
xlog("L_INFO", "[$ci] *** RTPEngine INVITE OFFER\n");
rtpengine_offer("$var(rtpengine_offerflags)");
}
}
xlog("L_NOTICE", "[$ci] branch_initial_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}
route[message]
{
# message received on external interface
if ($Rp == 5060 || $Rp == 7080)
{
xlog("L_NOTICE", "[$ci] message_external_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");
# mark routing logic in request
append_hf("P-hint: msg-to-hub\r\n");
# conditionally swap internal and external interfaces
route(swap);
# relay the request
t_on_reply("transaction");
t_relay();
xlog("L_NOTICE", "[$ci] message_external_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");
}
# message received on internal interface
else if ($Rp == 6050 || $Rp == 8070)
{
xlog("L_NOTICE", "[$ci] message_internal_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");
# mark routing logic in request
append_hf("P-hint: msg-from-hub\r\n");
# mid-registrar
xlog("L_INFO", "[$ci] registrar: msg looking up $ru $rm [$si:$sp] (from: $fu -> to: $tu)\n");
if (!mid_registrar_lookup("location")) {
xlog("L_INFO", "[$ci] registrar: msg_unregistered: $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
if ($rd=~".*.service")
{
# can not route .service domains without registration
xlog("L_INFO", "[$ci] registrar: msg_not_found: $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
t_reply("404", "Not Found");
exit;
}
# 2.3 and below self-check
if (uri == myself) {
# 2.4 and above self-check
#if (is_myself("$rd")) {
# uri refers to this proxy (probably from a past registration)
xlog("L_INFO", "[$ci] registrar: msg_this_server: $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
t_reply("404", "Not Found");
exit;
}
xlog("L_INFO", "[$ci] registrar: msg_as_addressed $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
# routing as addressed instead of failing, as NOTIFY is using direct uri
#t_reply("404", "Not Found");
#exit;
}
else {
xlog("L_INFO", "[$ci] registrar: msg_lookup_succeeded $rm [$si:$sp] $ru (from: $fu -> to: $tu)\n");
}
# conditionally swap internal and external interfaces
route(swap);
# relay the request
t_on_reply("transaction");
t_relay();
# route as addressed
xlog("L_NOTICE", "[$ci] message_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu)\n");
}
}
route[sequential] {
xlog("L_NOTICE", "[$ci] sequential_pre: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
# manipulate sdp
route(sdp);
route(transcode);
if (isflagset(transcode)) {
# NOTE: ACK may not be considered part of INVITE transaction in which case early flag would be out of scope
#if (!isflagset(early) && is_method("ACK") && has_body("application/sdp")) {
if (is_method("ACK") && has_body("application/sdp")) {
xlog("L_INFO", "[$ci] *** RTPEngine ACK ANSWER\n");
rtpengine_answer("$var(rtpengine_branchflags)");
}
# NOTE: not tearing down rtpengine session on CANCEL because it is
# sequential and branch is not known, and because branch will reply with
# failure status that will be associated with branch and this will
# tear down the rtpengine session for this branch
/*
else if (is_method("CANCEL")) {
# handle rtpengine teardown
xlog("L_INFO", "[$ci] *** RTPEngine CANCEL DELETE\n");
rtpengine_delete("$var(rtpengine_branchflags)");
}
*/
else if (is_method("BYE")) {
# handle rtpengine teardown
xlog("L_INFO", "[$ci] *** RTPEngine BYE DELETE\n");
# NOTE: not passing branch flags to rtpengine_delete here since branch is
# unknown to sequential request in dialog, and we assume all remaining
# rtpengine branches will be torn down as a result.
rtpengine_delete();
}
else if (is_method("INVITE") && has_body("application/sdp")) {
setflag(early);
xlog("L_INFO", "[$ci] *** RTPEngine INVITE OFFER\n");
rtpengine_offer("$var(rtpengine_offerflags)");
}
}
# swap internal and external interfaces
route(swap);
# relay the request
t_on_reply("transaction");
if (is_method("CANCEL")) {
# preserve the received Reason header
t_relay("8");
}
else {
t_relay();
}
xlog("L_NOTICE", "[$ci] sequential_post: $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}
route[reply] {
if (is_method("INVITE") && $rs=~"(180)|(183)|(200)|(202)") {
#xlog("L_INFO", "[$ci] cross-check: Contact $ct\n");
#xlog("L_INFO", "[$ci] cross-check: To user \"$tU\" in \"$tu\"\n");
#xlog("L_INFO", "[$ci] cross-check: Contact user \"$(ct.fields(uri){uri.user})\" in \"$ct.fields(uri)\"\n");
if($(ct.fields(uri){uri.user}) != null && $tU != $(ct.fields(uri){uri.user})) {
xlog("L_WARN", "[$ci] cross-check: to-user \"$tU\" in \"$tu\" does not match contact-user \"$(ct.fields(uri){uri.user})\" in \"$ct.fields(uri)\"\n");
}
}
# swap internal and external interfaces
route(swap);
# test for issue with natted uc client
# fix contact header for natted clients
#if (client_nat_test("3")) {
if (client_nat_test("9")) {
fix_contact();
}
# maniulate sdp
route(sdp);
}
onreply_route[transaction] {
# NOTE: when default onreply_route is present in addition to transaction onreply routes, *both* will be called.
xlog("L_NOTICE", "[$ci] onreply_transaction_pre: $rs to $rm $ru <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
route(transcode);
if (isflagset(transcode)) {
if (is_method("INVITE")) {
#if (t_check_status("1[0-9][0-9]")) {
if (t_check_status("(180)|(183)|(200)|(202)") && has_body("application/sdp")) {
if (!isflagset(early)) {
xlog("L_INFO", "[$ci] *** RTPEngine OK OFFER\n");
rtpengine_offer("$var(rtpengine_offerflags)");
}
else {
xlog("L_INFO", "[$ci] *** RTPEngine OK ANSWER\n");
rtpengine_answer("$var(rtpengine_branchflags)");
}
}
else if (t_check_status("[3-9][0-9][0-9]")) {
# currently disabling delete on failure until support for forking is
# configured since this delete will tear down the active branch too.
xlog("L_INFO", "[$ci] *** RTPEngine REJECT DELETE\n");
rtpengine_delete("$var(rtpengine_branchflags)");
}
}
}
route(reply);
xlog("L_NOTICE", "[$ci] onreply_transaction_post: $rs to $rm <$pr:$si:$sp -> $Ri:$Rp | $fs -> $dd:$dp> (from: $fu, to: $tu, from-tag: $ft)\n");
}
local.cfg
# listing external udp interface first so that it is used by mid-registrar
# when relaying registrations to the main registrar
# external interfaces
listen=udp:1.1.1.1:5060
listen=tcp:1.1.1.1:5060
listen=udp:1.1.1.1:7080
listen=tcp:1.1.1.1:7080
# internal interfaces
listen=udp:1.1.1.1:6050
listen=tcp:1.1.1.1:6050
listen=udp:1.1.1.1:8070
listen=tcp:1.1.1.1:8070
route[swap] {
# if message received on external port then switch to internal port
# if message received on internal port then switch to external port
# if message is destined for udp then switch to udp
# if message is destined for tcp then switch to tcp
if ($fs != NULL) {
xlog( "L_INFO","[$ci] swap: send socket already set to $fs, not swapping $pr:$Ri:$Rp\n" );
return;
}
# swap external and internal interfaces
if ($Rp == 5060) {
# if ruri is using tcp then force tcp in send socket
if ($rP == "tcp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:6050\n" );
force_send_socket(tcp:1.1.1.1:6050);
}
else if ($rP == "udp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:6050\n" );
force_send_socket(udp:1.1.1.1:6050);
}
else {
xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
}
}
else if ($Rp == 6050) {
# if ruri is using tcp then force tcp in send socket
if ($rP == "tcp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:5060\n" );
force_send_socket(tcp:1.1.1.1:5060);
}
else if ($rP == "udp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:5060\n" );
force_send_socket(udp:1.1.1.1:5060);
}
else {
xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
}
}
else if ($Rp == 7080) {
# if ruri is using tcp then force tcp in send socket
if ($rP == "tcp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:8070\n" );
force_send_socket(tcp:1.1.1.1:8070);
}
else if ($rP == "udp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:8070\n" );
force_send_socket(udp:1.1.1.1:8070);
}
else {
xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
}
}
else if ($Rp == 8070) {
# if ruri is using tcp then force tcp in send socket
if ($rP == "tcp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for tcp:1.1.1.1:7080\n" );
force_send_socket(tcp:1.1.1.1:7080);
}
else if ($rP == "udp") {
xlog( "L_INFO","[$ci] swap: swapping $pr:$Ri:$Rp for udp:1.1.1.1:7080\n" );
force_send_socket(udp:1.1.1.1:7080);
}
else {
xlog( "L_INFO","[$ci] swap: unknown r-uri proto, not swapping $pr:$Ri:$Rp\n" );
}
}
}
I have a feeling that I'm really close to figure it out. Have logs and coredump when contact_body_t allocated in PKG mem in one worker and then freed in another what crashed it. It has crashed because I have mem debug and q_malloc detected broken memory block. Without memory debug I suppose it may either crash or corrupt PKG memory. That pkg_malloc was called from dlg_update_contact()->parse_contact() during in-dialog CANCEL processing. Actually opensips allocated it in one worker when processed CANCEL and then tried to free when received 200OK for INVITE (the same transaction) in another worker.
To make it clear want to emphasize that in my case crash happens only when UAC sends reINVITE, received 100 Trying, and sends CANCEL immediately. While this "CANCEL" travels across the network, opensips has enough time to forward reINVITE and than it receives CANCEL from UAC and 200 OK for reINVITE from UAS a few moments later.
UAC proxy UAS
| ---reINVITE-> | |
| <-----100--- | |
| | ---reINVITE, CSeq: 5 INVITE-> |
| ---CANCEL-> | |
| | <--- 200 OK, CSeq: 5 INVITE --- |
| [ bang ] |
The patch attached fixed it for me. At least the problem I was able to reproduce with SIPP.
diff --git a/modules/dialog/dlg_handlers.c b/modules/dialog/dlg_handlers.c
index 8afca4601..4930dd54b 100644
--- a/modules/dialog/dlg_handlers.c
+++ b/modules/dialog/dlg_handlers.c
@@ -1259,6 +1259,11 @@ static void dlg_update_caller_contact(struct cell* t, int type, struct tmcb_para
return;
}
+ if (msg->msg_flags&FL_SHM_CLONE) {
+ LM_ERR("skip SHM allocated msg\n",);
+ return;
+ }
+
dlg_update_contact(dlg, msg, DLG_CALLER_LEG);
}
@@ -1275,6 +1280,11 @@ static void dlg_update_callee_contact(struct cell* t, int type, struct tmcb_para
return;
}
+ if (msg->msg_flags&FL_SHM_CLONE) {
+ LM_ERR("skip SHM allocated msg\n",);
+ return;
+ }
+
dlg_update_contact(dlg, msg, callee_idx(dlg));
}
Unfortunately, it looks like that was not the only crash related to dialog module in 2.4.7
Got a new crash:
#0 context_get_ptr (type=CONTEXT_DIALOG, ctx=0xfa, pos=0) at ../../context.h:179
#1 0x00007f476306a1be in dlg_ctx_get_ptr (dlg=0x12, pos=0) at dlg_ctx.c:65
#2 0x00007f47634eee01 in free_acc_ctx (ctx=0x7f476e3af640) at acc_logic.c:205
#3 0x00007f47634f37c8 in unref_acc_ctx (ctx=0x7f476e3af640) at acc_logic.c:1287
#4 0x00007f47691a3c54 in empty_tmcb_list (head=0x7f476d64a3d8) at t_hooks.c:53
#5 0x00007f4769188a51 in free_cell (dead_cell=0x7f476d64a368) at h_table.c:127
#6 0x00007f47691b6885 in delete_cell (p_cell=0x7f476d64a368, unlock=1) at timer.c:239
#7 0x00007f47691b750a in wait_handler (wait_tl=0x7f476d64a3e8) at timer.c:458
#8 0x00007f47691b94b8 in timer_routine (ticks=135779, set=0x1) at timer.c:1091
#9 0x00000000004a703e in handle_timer_job () at timer.c:731
#10 0x000000000054806c in handle_io (fm=0x7f47799bf0a8, idx=3, event_type=1) at net/net_udp.c:265
#11 0x0000000000546983 in io_wait_loop_epoll (h=0x9014e0 <_worker_io>, t=1, repeat=0) at net/../io_wait_loop.h:280
#12 0x000000000054896b in udp_start_processes (chd_rank=0x80371c <chd_rank>, startup_done=0x7f476bd40b70) at net/net_udp.c:389
#13 0x000000000043d0c5 in main_loop () at main.c:784
#14 0x000000000043fde7 in main (argc=13, argv=0x7ffe1e40b5f8) at main.c:1442
@vitalikvoip Curious did your patch address the crash in free_contacts() during transaction clean-up, or only the dialog-related crash?
My patch addresses crash I had in dialog module when had described call flow with CANCEL from caller and 200 OK from callee. Had another one with transaction clean up but it's still related to dialog because crashes when cleans dialog context created by acc module. I don't actually need acc at the moment, but still wanted to fix it since I've already dived deep into this part of opensips, but couldn't reproduce it manually and my production proxies stay stable since last week. Will try to analyze when it happens again.
It's been working without any issues for 5 days. @attermann if your opensips still crashes and you see in coredumps it happens around free_contacts() I think it worth trying my patch.
I suspect that there is a TMCB_RESPONSE_OUT (type 128) TM callback (registered by other module) that is parsing the Contact hdr of the request that was cloned in SHM memory. Now, if you have high load, you may have SIP retransmissions, so you may end up with two 200 reply's handled in different procs (in the same time), both trying to "parse" + "use" + "free" the contact hdr over the same sip msg structure (the cloned, shared request that is saved into transaction). Now, the TMCB_RESPONSE_OUT callbacks are executed under lock, so maybe there is some part of the code (outside that callback) that tries to parse the Contact....and overlaps with the callback There are a couple of modules using the TMCB_RESPONSE_OUT callback and potentially parsing the Contact hdr, so let me try to narrow down the possibilities...
After applying the patch I've posted above I had no more crashes. So maybe in my case it was only due to CANCEL handling since CANCEL is not end-2-end request and dlg update functions dealt with the cloned sip msg. Haven't thought about retransmissions from this angle. Your assumption makes perfect sense and most likely retransmission maybe be another corner case.
You suggest that the concurrent access+parsing+free over the request stored in shmem happens between (1) the 200 OK reply and (2) the CANCEL ?
And your patched eliminated this concurrency by not doing any parsing for the 200 OK ?
The problem was not in concurrency. Back then I couldn't understand why it crashes and the described case was the only one I knew to reproduce the crash. As far as I understood all parsing functions are supposed to be called only on PKG allocated sip_msg received from a wire and before it is cloned to SHM by tm module. Normally, when tm executes TMCB_REQUEST_BUILT callbacks it provides either PKG allocated sip_msg that was read before or faked_req (do_dns_failover) which is cleaned just after usage in the same process. The CANCEL request is an exception since t_cancel.c:cancel_branch() provides an SHM allocated sip_msg (t->uas.request) to callbacks what makes any parsing dangerous. When it happens, one process handles CANCEL request and parses its contact allocating some info in PKG mem and attaching it to SHM allocated sip_msg. Then after a while another process (for example timer) may try to read that data and dereference those PKG pointers and reads some random unexpected data or writes something that leads to memory corruption.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Cause I have a lot of environment running the latest 3.1 from rpm packages, I can take some coredump info to provide more info if needed.
Let me know if this will help.
I have it running almost a month without any issues while had at least one crash day before applied the patch.
I’ll try your patch to see if it will fix for me too.
I’ll try your patch to see if it will fix for me too.
You've mentioned you are using opensips 3.1. I've tested my patch only with 2.4.7 and don't if this part is still the same in 3.1. Anyhow give it a chance and I can take a look at your coredumps closely when have a spare moment.
I’m able to locate the block of code on 3.1 where i need to apply your patch. Today I’ll try to see if it fix for me on the latest 3.1.
@Fernandojdk , did you have a chance to check it with 3.1?
Hello @vitalikvoip
I applied your patch on opensips 3.1, but the bug continued when we are under high load. For 3.1 this patch did not work. Unfortunately. The frequency at which the bug happens has decreased, but still here.
Hey guys,
@razvancrainea came up with the following patch for 2.4
that should address the issue (PKG-parsed contacts intermingling with a SHM-stored SIP message), would you mind giving it a try? We are undergoing some testing as well on it, as of lately.
git apply <(base64 -d <<EOF | gzip -dc
H4sIAMyQP18AA7VVa0/bSBT97PyKS1ZQx688SKEtBUEJUKQ0XSHYL93KMvbEtmrPuJ5x22jhv++9
fuRVAlRqpcSex52Zc849cx3E0ynYdhgr8LqpCIqEyW4Qe4kIu0ESupHHg4Tl0vHh9vH5VswD9gN2
9wZs1/cGvuP02WB/Ohh6AfR7vb3hsGXb9lOntEzTfPKk42OwXw6tPTDxuQ/Hxy34K+Z+UgQM2o7T
xV/gKc9NijRzovbPs5mXS5ZXL1eJp2N8yb5SlLkhyhdceb5q3o9t2ITWGz+9oArM8+ejJH36g94r
q48SYeO1NSCVQCpPxT7EPIk5w5cCErbIUKw5El2qvPCrGZ8lCRjYsqAelXHmpjIEAx9WC/5rgYYz
UK89wK4feTkYKs0OWqZGR+RMwSH0qFuHuQoMnwYnN+MxjuOqeAr6Fu5pH9UxsLODw5qmV8wi5gWY
e52OhfejK/f04+T65PTaPbeg14G30IO7O1jZodMp4WnjD+7o3YXevpz8c3l9BiKHm79HJ9j6HqtI
FOj7Bj7YwIWCUo+Yh1v/8naHKGlIocg5cbC1e2CJRPHWAdtHJdIAgcNKavUVUASVgNklsMvJ+ccN
yOA2F18YfxybvYzN1O5J8k3A6FRTa3KA6q8G3YpgRntoKo9TN0G98Rh9p8FdTtHeK9zkPMCCnSV2
GPs76NE2y/xKgksMCHeRx+XU/DhM9hs4FwUPoOB1Tpoln7YdQ37Gza1myEkYX3RkeWid4jW5dL2x
L0lFHu48KPN8RNbgAGEDpuq3bEXXMhH+FxcbOv5Lg1JesG0fJSyUn/Dx2Vkwqm5B1yDLqoiB9FIG
86wFgr/Ayy7A4zMVofRgdOv6MXhpvaLysbtv/Znqgdml6sH9NNsIfyk3KznrwOFh5TXQFnf8RhKD
dZbwltJ+BFO0YPVBge0MBAc8CrYDskMFRnvYFBaUJDC6unJlEgq+lgZ73axaKJQAxoOyhtzTo2Q5
TbxQwt0hjMYX7vn45MI9fX8yuTgbHSyXqyuWJZ5PdEQS/AqTKnm7e3XxH+5b/dd/KH1Y5dHNMkrd
1EtQDn05QY0xtzBqUYrPrq70Nl11xkURRpCyVOSzkhBn3xuei/K2SepaabtfSl19Y+rOsuz31ZV5
0Fq4ovxMbYxAFlRkFj2KRcB+Nvt1v9LauU1L0RdZRbzPtOgzDGrWapAYqMKbumg8IGKtYakaZYoK
+DRnK9VdNTtSJL6QBlWzxk3fRBys+Ah9wPJ1O5GVDMBqQ65Ts4zNLaVS/9bFauelEoxMdlr/AycT
XeKQCgAA
EOF
)
LE: I have edited the above patch, do not use the version included in the email! :)
Hello guys,
I've changed the code posted by @liviuchircu and ported to 3.1 to see if it will fix this issue.
On the next week i can tell if this fix works.
We've tested this in several environments and crashes stopped to happened, hence I believe this issue is fixed. If you're still facing this problem, please re-open the ticket with further info.
Hello guys.
Just for feedback. I already adapted the above fix to work with 3.1 and until now we have no crash anymore. So i think this issue is fixed.
Thanks all 🥇
Hi @liviuchircu. I was able to apply your patch above to 2.4.7 and will be testing it in production over the next week. I also tried applying it to 2.4.8 but it failed. Is this patch already in 2.4.8?
Yes it is, @razvancrainea pushed the same fix in df68e3b19
OpenSIPS version you are running
Describe the bug
Opensips crashes randomly under load. Sometime multiple times a day, and other times a day or more between crashes.
Some backtraces show slight variations in the call stack but always eventually through run_trans_callbacks_locked() and segfaults in free_contacts ().
To Reproduce
Unable to reproduce. Happens randomly in production under load.
Expected behavior
No segfault!
Relevant System Logs
OS/environment information
Additional context
Seems to be the same issue as reported mid-March in OpenSIPS-Users "Re: segfault at the opensips 2.4.5 (Liviu Chircu) (Johan De Clercq)".