OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.28k stars 581 forks source link

[CRASH] segfault 2.4.7 every ~ 30-40 min if active calls is about 170 and high #2035

Closed olegans1972 closed 4 years ago

olegans1972 commented 4 years ago

OpenSIPS version you are running

version: opensips 2.4.7 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll, sigio_rt, select. git revision: 597f81b main.c compiled on 10:10:43 Mar 13 2020 with gcc 4.8.5

Crash Core Dump

Describe the traffic that generated the bug

To Reproduce

Relevant System Logs

OS/environment information

Additional context

liviuchircu commented 4 years ago

Please follow this guide and obtain a proper corefile and gdb backtrace. Alternatively, please provide some steps to reproduce this issue and an example opensips.cfg.

There is literally 0 information to work with right now.

olegans1972 commented 4 years ago

Hi, I saw this guide. I didn't do only one thing "ulimit -c unlimited" Could you explain where I can insert this command if I do not have (init.d) at all? I have Centos7

liviuchircu commented 4 years ago

Just make sure you call ulimit -c unlimited before the opensips start command (e.g. opensips -f /etc/opensips.cfg -m 256 -M 32). The initscript would do this exact same thing.

l2dy commented 4 years ago

Hi, I saw this guide. I didn't do only one thing "ulimit -c unlimited" Could you explain where I can insert this command if I do not have (init.d) at all? I have Centos7

Try this: https://unix.stackexchange.com/a/345596.

olegans1972 commented 4 years ago

Hi, I saw this guide. I didn't do only one thing "ulimit -c unlimited" Could you explain where I can insert this command if I do not have (init.d) at all? I have Centos7

Try this: https://unix.stackexchange.com/a/345596.

Thank you!

olegans1972 commented 4 years ago

opensips.txt It's my config. As soon as the core file is ready, I'll put it

olegans1972 commented 4 years ago

So, I have 14 core.files. Do you need back traсe earch of them?

First two files:

core.20429

(gdb) bt full

0 get_httpd_cb (url=url@entry=0x2234fa4 "/json/get_statistics") at httpd_proc.c:177

    url_len = 20
    index = 1
    cb = 0x7f471a75a590
    http_root = 0x0
    __FUNCTION__ = "get_httpd_cb"

1 0x00007f4712c4404a in answer_to_connection (cls=0x0, connection=0x2234e20, url=0x2234fa4 "/json/get_statistics", method=, version=0x2234fc4 "HTTP/1.1", upload_data=0x0,

upload_data_size=0x7ffe7d373b90, con_cls=0x2234e78) at httpd_proc.c:647
    page = {s = 0x0, len = 0}
    response = <optimized out>
    ret = <optimized out>
    cb = 0x0
    normalised_url = <optimized out>
    pr = 0x0
    kv = <optimized out>
    p = <optimized out>
    ret_code = 200
    sv_sockfd = <optimized out>
    addrlen = 16
    cl_socket = 0x220c7d0
    __FUNCTION__ = "answer_to_connection"

2 0x00007f4712a2d549 in call_connection_handler () from /lib64/libmicrohttpd.so.10

No symbol table info available.

3 0x00007f4712a2eab0 in MHD_connection_handle_idle () from /lib64/libmicrohttpd.so.10

No symbol table info available.

4 0x00007f4712a33f49 in MHD_run_from_select () from /lib64/libmicrohttpd.so.10

No symbol table info available.

5 0x00007f4712a341e9 in MHD_select () from /lib64/libmicrohttpd.so.10

No symbol table info available.

6 0x00007f4712a33d8b in MHD_run () from /lib64/libmicrohttpd.so.10

No symbol table info available.

7 0x00007f4712c45008 in httpd_proc (rank=) at httpd_proc.c:808

    status = <optimized out>
    rs = {__fds_bits = {1024, 0 <repeats 15 times>}}
    ws = {__fds_bits = {0 <repeats 16 times>}}
    es = {__fds_bits = {0 <repeats 16 times>}}
    max = 10
    cb = <optimized out>
    __FUNCTION__ = "httpd_proc"
    tv = {tv_sec = 0, tv_usec = 868327}
    saddr_in = {sin_family = 2, sin_port = 47138, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}

8 0x00000000004a03d8 in start_module_procs ()

No symbol table info available.

9 0x000000000041b5d4 in main ()

No symbol table info available.

core.20432

(gdb) bt full

0 0x000000000041445b in utimer_ticker ()

No symbol table info available.

1 0x00000000004778dd in start_timer_processes ()

No symbol table info available.

2 0x000000000041b81f in main ()

No symbol table info available. (gdb)

olegans1972 commented 4 years ago

core.20433

(gdb) bt full

0 0x0000000000436f61 in receive_msg ()

No symbol table info available.

1 0x0000000000536bb0 in udp_read_req ()

No symbol table info available.

2 0x00000000005218ca in udp_start_processes ()

No symbol table info available.

3 0x000000000041b8a5 in main ()

No symbol table info available. (gdb)

liviuchircu commented 4 years ago

In core.20429, could you please provide the output of the following commands:

(gdb) frame 0
(gdb) print *cb
(gdb) print cb->http_root

Thanks!

olegans1972 commented 4 years ago

(gdb) frame 0

0 get_httpd_cb (url=url@entry=0x2234fa4 "/json/get_statistics") at httpd_proc.c:177

177 in httpd_proc.c (gdb)

0 get_httpd_cb (url=url@entry=0x2234fa4 "/json/get_statistics") at httpd_proc.c:177

177 in httpd_proc.c (gdb) print cb $1 = {module = 0x1100000000000000 <Address 0x1100000000000000 out of bounds>, http_root = 0x0, callback = 0xf800000000000000, flush_data_callback = 0x7f472a7667, init_proc_callback = 0x900000000000000, type = HTTPD_STD_CNT_TYPE, next = 0x500007f471adb61} (gdb) print cb->http_root $2 = (str ) 0x0 (gdb)

liviuchircu commented 4 years ago
olegans1972 commented 4 years ago

var/log/opensips.log:

Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20450]: failure_route[VMS_FAILOVER],1584372098620709|F693FDD02FD148ED13ED3997@0770ffffffff|Rcv|INVITE|sctp|10.93.137.170:54949|10.161.20.225:5060|sip.from=9647748752;sip.to=9282000026, Failed FS detected Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20451]: failure_route[VMS_FAILOVER],1584372098620713|F25893B4B2B225F7EA59B097@0c70ffffffff|Rcv|INVITE|sctp|10.5.145.170:59617|10.161.20.225:5060|sip.from=9884568831;sip.to=9282000026, F ailed FS detected Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20450]: failure_route[VMS_FAILOVER],1584372098621038|7FFFAB8B693F211E0BFECB17@1670ffffffff|Rcv|INVITE|sctp|10.161.30.170:44302|10.161.20.225:5060|sip.from=+79268737653;sip.to=9282000026 , Failed FS detected Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20450]: failure_route[VMS_FAILOVER],1584372098621140|D2D0048BA26D159FA2EB0A17@0670ffffffff|Rcv|INVITE|sctp|10.9.212.170:55059|10.161.20.225:5060|sip.from=4951346605;sip.to=9282000026, F ailed FS detected Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803740 ms), it may overlap.. Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803740 ms), it may overlap.. Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803740 ms), it may overlap.. Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803740 ms (now 1803840 ms), it may overlap.. Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803840 ms), it may overlap.. Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803840 ms), it may overlap.. Mar 16 15:21:38 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803840 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803740 ms (now 1803940 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803940 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803940 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1803940 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803740 ms (now 1804040 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1804040 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1804040 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1804040 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803740 ms (now 1804140 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1804140 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1804140 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 1804140 ms), it may overlap.. Mar 16 15:21:39 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803740 ms (now 1804240 ms), it may overlap.. . . . Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 2074200 ms), it may overlap.. Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 2074200 ms), it may overlap.. Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803740 ms (now 2074300 ms), it may overlap.. Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 2074300 ms), it may overlap.. Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 2074300 ms), it may overlap.. Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20432]: WARNING:core:utimer_ticker: utimer task already scheduled for 1803640 ms (now 2074300 ms), it may overlap.. Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20441]: ERROR:proto_sctp:proto_sctp_send: sctp_sendmsg(sock,0x7f472a766798,361,0x7f471ab05070,16,0,0,0,0,0): Broken pipe(32) Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20441]: ERROR:tm:msg_send: send() to 10.93.137.170:5060 for proto sctp/4 failed Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20433]: WARNING:core:handle_timer_job: timer job has a 269770000 us delay in execution Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20434]: WARNING:core:handle_timer_job: utimer job has a 270560000 us delay in execution Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20443]: WARNING:core:handle_timer_job: timer job has a 269770000 us delay in execution Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20451]: failure_route[VMS_FAILOVER],1584372369389815|033B795684DB42293F3FB117@0970ffffffff|Rcv|INVITE|sctp|10.6.103.242:35993|10.161.20.225:5060|sip.from=9188183024;sip.to=9282000026, F ailed FS detected Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20451]: failure_route[VMS_FAILOVER],1584372369390667|88A31C6E6E23812C301CD197@1870ffffffff|Rcv|INVITE|sctp|10.5.145.170:59617|10.161.20.225:5060|sip.from=9640228920;sip.to=9282000026, F ailed FS detected Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20443]: WARNING:core:handle_timer_job: timer job has a 269770000 us delay in execution Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20436]: WARNING:core:handle_timer_job: timer job has a 269770000 us delay in execution Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20439]: WARNING:core:handle_timer_job: timer job has a 269770000 us delay in execution Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20441]: ERROR:rest_client:resume_async_http_req: connected, but transfer timed out (20s) Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20436]: CRITICAL:tm:set_timer: set_timer for 1 list called on a "detached" timer -- ignoring: 0x7f471ad7ab58 Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20441]: 1584372369392056|DD351F552825E693A0331217@1070ffffffff|Rcv|HTTPResp|tcp|127.0.0.1:28080|| -3 Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20434]: CRITICAL:tm:set_timer: set_timer for 1 list called on a "detached" timer -- ignoring: 0x7f471ad7ab58 Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20451]: failure_route[VMS_FAILOVER],1584372369392346|F4088A55FA84831F9CA0A397@0670ffffffff|Rcv|INVITE|sctp|10.68.248.170:52636|10.161.20.225:5060|sip.from=9654074694;sip.to=9282000026, Failed FS detected Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20441]: ACC: call missed: timestamp=1584372369;created=1584372094;setuptime=275;method=INVITE;from_tag=Bg17Z1g7A.Cdj.a5;to_tag=;call_id=DD351F552825E693A0331217@1070ffffffff;code=500;reason=Server Internal Error;ip.src=10.95.137.226;ip.dst=;ip.sdp=10.95.135.74;sip.from=9899274319;sip.to=9282000026;isup.reReson=6;isup.rePN=9287401139;isup.4xx_fs.cause=;isup.bye_mss.cause=;sip.bye_fs.reason=;spx.responce=;spx.req={"X-Orig-CdPN":"9202391120","X-Orig-CdPN-NAI":3,"X-Orig-Re-Reason":0,"X-Re-Reason":6,"X-RePN":"9202391120","X-RePN-NAI":3,"X-CdPN":"9282000026#","X-CdPN-NAI":3,"X-CgPN":"9261261271","X-CgPN-NAI":3,"X-CgPN-APRI":0,"X-GN":"79994939579","X-GN-NAI":4} Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20434]: CRITICAL:tm:set_timer: set_timer for 1 list called on a "detached" timer -- ignoring: 0x7f471a885038 Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20451]: failure_route[VMS_FAILOVER],1584372369392608|361B10D577101C4917313B17@0870ffffffff|Rcv|INVITE|sctp|10.93.137.170:54949|10.161.20.225:5060|sip.from=+79183905630;sip.to=9282000026, Failed FS detected Mar 16 15:26:09 kv-spx-1 /usr/sbin/opensips[20433]: CRITICAL:tm:set_timer: set_timer for 1 list called on a "detached" timer -- ignoring: 0x7f471a885038

olegans1972 commented 4 years ago

We use 256 MB shared memory

olegans1972 commented 4 years ago

real_used_size_kv_spx_1

olegans1972 commented 4 years ago

This is real_used_size before and after alarm. The alarm was at (15:26(UTC) + 3)

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

olegans1972 commented 4 years ago

Which update? I attached all the information that was requested from me.

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

olegans1972 commented 4 years ago

I'm waiting for fix.

olegans1972 commented 4 years ago

Hello @liviuchircu! Tell me please! When can we expect a new release 2.4.8? Is it possible to make some patch for 2.4.7 on this problem?

liviuchircu commented 4 years ago

hey, @olegans1972 ! This is a generic shared memory corruption problem, since your monitoring shows you have enough SHM (the crash takes place below 10% usage). To troubleshoot it, I need to understand what kind of code is called and how the corruption can happen. Please provide the following:

Many thanks and apologies for the dead time. We are heavily pushing with 3.1 development these days and I didn't prioritize my GitHub issues so much. Getting back to them now :)

olegans1972 commented 4 years ago

Hey @liviuchircu ! I attached my opensips.cfg when I opened this issue) it was 16.03

opensips.txt It's my config. As soon as the core file is ready, I'll put it

olegans1972 commented 4 years ago

HI @liviuchircu ! It is very difficult to catch, to reproduce this situation. I tried to run a typical script on a test system. The call is as follows: SippUAC -> Opensips <---> HttpServerRequest (Responce) -> SippUAS. The script ran with the following parameters 25,000 calls 100 of simultaneous calls and 100 calls per seconds. One call lasts 2 seconds. I see how shared memory grows. If you put more than 30,000 calls, then segafault occurs because shmem memory runs out

It seemed strange to me that 10 minutes after the end of the test, the memory in top is still large. Perhaps my problem is that the memory is gradually accumulating. It is strange that on the prod system I do not see an increase in memory consumption. But on the test this happens. I started my test the RES at TOP was 2244. After test it was 185432, After 30 minutes The RES was almost the same. This is my sipp UAC stress_uac_opensips_no_rtp_cf_1_1_isup.txt

olegans1972 commented 4 years ago

Hello @liviuchircu!

I checked again, the memory does not leak. There was an error in the SIPP script

liviuchircu commented 4 years ago

Cheers, @olegans1972 - do let me know if you find a way to reproduce the crash and I will resume the work here. Otherwise, the opensips.txt in-depth analysis will have to wait for the beta testing phase of 3.1, when I will be able to dedicate a good half of day's work for it... unless someone else chimes in and offers to help in the meantime, of course :)

olegans1972 commented 4 years ago

Hi! @liviuchircu! I can’t reproduce the problem on the lab We decided to switch to version 3.0.2. And try to solve all the problems on the new version. I have already opened a task https://github.com/OpenSIPS/opensips/issues/2105

This problem is not related to segfault, but as it seems to us, it could have contributed to it.

liviuchircu commented 4 years ago

This is interesting! So you're saying that rest_client works 100% fine on 2.4.x, with no async(rest_post()) occasional timeouts?

olegans1972 commented 4 years ago

Sometimes I saw the same problem. But at the 2.4.7 we use default timeout 20 sec. Probably the buffer could fill up and this led to segfault so now we tried to limit it with a timer

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

olegans1972 commented 4 years ago

We have switched on all our servers to version 3.0.2 and continue observations there

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

stale[bot] commented 4 years ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.