OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.28k stars 580 forks source link

[CRASH] PgSQL driver with remote database connection #2114

Closed volga629-1 closed 4 years ago

volga629-1 commented 4 years ago

Hello Everyone, Postgresql with dispatcher and dispatcher segfault. Can't start opensips properly.

May 16 04:18:54 aitossbc01 systemd[1]: systemd-coredump@2605-247697-0.service: Consumed 1.870s CPU time.
May 16 04:18:55 aitossbc01 kernel: opensips[247424]: segfault at 0 ip 00007f5faa41c487 sp 00007ffff2860da0 error 4 in dispatcher.so[7f5faa40b000+1a000]
May 16 04:18:55 aitossbc01 kernel: Code: 00 48 c7 44 24 70 00 00 00 00 48 c7 84 24 90 00 00 00 04 00 00 00 48 c7 44 24 40 00 00 00 00 0f 1f 40 00 49 8b 85 c8 00 00 00 <4c> 8b 20 89 e8 41 87 04 24 bb 00 04 00 00 85 c0 75 14 eb 25 0f 1f
opensips -V
version: opensips 3.1.0-dev (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, CC_O0, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on 00:00:00 Jan 21 2020 with gcc 9
ccpp-2020-05-15-16:34:11.84668-218623]# gdb coredump
GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 218623]
[New LWP 218751]
[New LWP 218763]
Reading symbols from /usr/sbin/opensips...
Reading symbols from /usr/lib/debug/usr/sbin/opensips-3.1.0-4.fc31.x86_64.debug...

warning: Ignoring non-absolute filename: <linux-vdso.so.1>
Missing separate debuginfo for linux-vdso.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/31/089c6abd0e7673e62e23adb494c06a60e4884c
Missing separate debuginfo for /usr/lib64/opensips/modules/tls_mgm.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/07/ef0046b7e8a9fc1baae67d0071a1af6fc72b6d
Missing separate debuginfo for /usr/lib64/opensips/modules/cachedb_cassandra.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/48/f5c8e99d1fee76bc21b280f6f81108e17960c1
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/14/55764bf748129dd38a0d7ed9e3a8f4c0261c65
Missing separate debuginfo for /usr/lib64/opensips/modules/httpd.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/c0/c34180a55fddf66581287a7bc4ba6e42ce13b5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /usr/lib64/opensips/modules/tls_mgm.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/07/ef0046b7e8a9fc1baae67d0071a1af6fc72b6d.debug
Missing separate debuginfo for /usr/lib64/opensips/modules/cachedb_cassandra.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/48/f5c8e99d1fee76bc21b280f6f81108e17960c1.debug
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/14/55764bf748129dd38a0d7ed9e3a8f4c0261c65.debug
Missing separate debuginfo for /usr/lib64/opensips/modules/httpd.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/c0/c34180a55fddf66581287a7bc4ba6e42ce13b5.debug
Core was generated by `/usr/sbin/opensips -P /run/opensips/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f62c368bc6c in ds_update_weights (ticks=<optimized out>, param=<optimized out>) at dispatch.c:2403
2403            lock_start_write(part->lock);
[Current thread is 1 (Thread 0x7f62f8496b80 (LWP 218623))]
Missing separate debuginfos, use: dnf debuginfo-install libunistring-0.9.10-6.fc31.x86_64
(gdb) bt
#0  0x00007f62c368bc6c in ds_update_weights (ticks=<optimized out>, param=<optimized out>) at dispatch.c:2403
#1  0x000055a0aa8ed27a in handle_timer_job () at timer.c:868
#2  0x000055a0aa9dd325 in handle_io (idx=<optimized out>, event_type=<optimized out>, fm=<optimized out>) at net/net_udp.c:276
#3  io_wait_loop_epoll (repeat=0, t=1, h=<optimized out>) at net/../io_wait_loop.h:308
#4  0x000055a0aa9e2605 in udp_start_processes (chd_rank=<optimized out>, startup_done=<optimized out>) at net/net_udp.c:497
#5  0x000055a0aa884589 in main_loop () at main.c:798
#6  main (argc=<optimized out>, argv=<optimized out>) at main.c:1483
(gdb) bt full
#0  0x00007f62c368bc6c in ds_update_weights (ticks=<optimized out>, param=<optimized out>) at dispatch.c:2403
        part = 0x7f62c52ba278
        sp = <optimized out>
#1  0x000055a0aa8ed27a in handle_timer_job () at timer.c:868
        t = 0x7f62c52ba6d0
        l = <optimized out>
        __FUNCTION__ = "handle_timer_job"
#2  0x000055a0aa9dd325 in handle_io (idx=<optimized out>, event_type=<optimized out>, fm=<optimized out>) at net/net_udp.c:276
        n = <optimized out>
        read = 32610
        n = <optimized out>
        read = <optimized out>
        __FUNCTION__ = "handle_io"
#3  io_wait_loop_epoll (repeat=0, t=1, h=<optimized out>) at net/../io_wait_loop.h:308
        ret = <optimized out>
        n = <optimized out>
        r = 3
        i = <optimized out>
        e = 0x7f62f4d71530
        ep_event = {events = 2861655080, data = {ptr = 0xc52c650000005500, fd = 21760, u32 = 21760, u64 = 14207841975121499392}}
        fd = <optimized out>
        curr_time = <optimized out>
        __FUNCTION__ = "io_wait_loop_epoll"
#4  0x000055a0aa9e2605 in udp_start_processes (chd_rank=<optimized out>, startup_done=<optimized out>) at net/net_udp.c:497
        si = <optimized out>
        p_id = <optimized out>
        i = <optimized out>
        p = <optimized out>
        __FUNCTION__ = "udp_start_processes"
#5  0x000055a0aa884589 in main_loop () at main.c:798
        startup_done = 0x0
        last_check = 0
        rc = <optimized out>
        chd_rank = 3
        startup_done = <optimized out>
        last_check = <optimized out>
        rc = <optimized out>
        __FUNCTION__ = "main_loop"
#6  main (argc=<optimized out>, argv=<optimized out>) at main.c:1483
        c = <optimized out>
        r = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging-- 
        tmp = 0x7ffc154d1eb9 ""
        tmp_len = <optimized out>
        port = <optimized out>
        proto = <optimized out>
        protos_no = <optimized out>
        options = 0x55a0aaa2aec8 "f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:p:P:G:W:o:a:k:s:"
        seed = 614515474
        rfd = <optimized out>
        __FUNCTION__ = "main"
volga629-1 commented 4 years ago

That database connection error

May 15 17:45:35 aitossbc01 /usr/sbin/opensips[20073]: ERROR:db_postgres:db_postgres_new_connection: timeout expired
May 15 17:45:35 aitossbc01 /usr/sbin/opensips[20073]: ERROR:db_postgres:db_postgres_new_connection: cleaning up 0x7f5e8cfbe148=pkg_free()
May 15 17:45:35 aitossbc01 /usr/sbin/opensips[20073]: ERROR:core:db_do_init: could not add connection to the pool
volga629-1 commented 4 years ago

From db side I see

May 16 04:14:58 aitosdb01 pgpool[44518]: [156-1] 2020-05-16 04:14:58: pid 44518: LOG:  Terminate message from frontend.
May 16 04:14:58 aitosdb01 pgpool[44527]: [122-1] 2020-05-16 04:14:58: pid 44527: LOG:  new connection received
May 16 04:14:58 aitosdb01 pgpool[44527]: [122-2] 2020-05-16 04:14:58: pid 44527: DETAIL:  connecting host=10.100.101.4 port=40854
May 16 04:14:58 aitosdb01 pgpool[44527]: [123-1] 2020-05-16 04:14:58: pid 44527: LOG:  Query message from frontend.
May 16 04:14:58 aitosdb01 pgpool[44527]: [123-2] 2020-05-16 04:14:58: pid 44527: DETAIL:  query: "select table_version from version where table_name='dispatcher'"
May 16 04:14:58 aitosdb01 pgpool[44527]: [124-1] 2020-05-16 04:14:58: pid 44527: LOG:  Query message from frontend.
May 16 04:14:58 aitosdb01 pgpool[44527]: [124-2] 2020-05-16 04:14:58: pid 44527: DETAIL:  query: "select setid,destination,socket,weight,attrs,priority,description,state from dispatcher "
May 16 04:14:58 aitosdb01 pgpool[44527]: [125-1] 2020-05-16 04:14:58: pid 44527: LOG:  Terminate message from frontend.
May 16 04:14:58 aitosdb01 pgpool[44513]: [148-1] 2020-05-16 04:14:58: pid 44513: LOG:  new connection received
volga629-1 commented 4 years ago

After looking up more Seems like crash related to commit

https://github.com/OpenSIPS/opensips/commit/43c7d916d8a807d00021818f1a120284cb08b50f

ticket

https://github.com/OpenSIPS/opensips/issues/1911

volga629-1 commented 4 years ago
[root@aitossbc01 ccpp-2020-05-17-07:38:51.359396-505094]# gdb coredump
GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 505094]
[New LWP 505290]
[New LWP 505274]
Reading symbols from /usr/sbin/opensips...
Reading symbols from /usr/lib/debug/usr/sbin/opensips-3.1.0-4.fc31.x86_64.debug...

warning: Ignoring non-absolute filename: <linux-vdso.so.1>
Missing separate debuginfo for linux-vdso.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/31/089c6abd0e7673e62e23adb494c06a60e4884c
Missing separate debuginfo for /usr/lib64/opensips/modules/tls_mgm.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/07/ef0046b7e8a9fc1baae67d0071a1af6fc72b6d
Missing separate debuginfo for /usr/lib64/opensips/modules/cachedb_cassandra.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/48/f5c8e99d1fee76bc21b280f6f81108e17960c1
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/14/55764bf748129dd38a0d7ed9e3a8f4c0261c65
Missing separate debuginfo for /usr/lib64/opensips/modules/httpd.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/c0/c34180a55fddf66581287a7bc4ba6e42ce13b5
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /usr/lib64/opensips/modules/tls_mgm.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/07/ef0046b7e8a9fc1baae67d0071a1af6fc72b6d.debug
Missing separate debuginfo for /usr/lib64/opensips/modules/cachedb_cassandra.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/48/f5c8e99d1fee76bc21b280f6f81108e17960c1.debug
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/14/55764bf748129dd38a0d7ed9e3a8f4c0261c65.debug
Missing separate debuginfo for /usr/lib64/opensips/modules/httpd.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/c0/c34180a55fddf66581287a7bc4ba6e42ce13b5.debug
Core was generated by `/usr/sbin/opensips -P /run/opensips/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f8697ae8487 in ds_flusher_routine (ticks=<optimized out>, param=<optimized out>) at dispatch.c:747
747         lock_start_read( partition->lock );
[Current thread is 1 (Thread 0x7f86cc900b80 (LWP 505094))]
Missing separate debuginfos, use: dnf debuginfo-install libunistring-0.9.10-6.fc31.x86_64
(gdb) bt 
#0  0x00007f8697ae8487 in ds_flusher_routine (ticks=<optimized out>, param=<optimized out>) at dispatch.c:747
#1  0x000055a91eb1327a in handle_timer_job () at timer.c:868
#2  0x000055a91ec03325 in handle_io (idx=<optimized out>, event_type=<optimized out>, fm=<optimized out>) at net/net_udp.c:276
#3  io_wait_loop_epoll (repeat=0, t=1, h=<optimized out>) at net/../io_wait_loop.h:308
#4  0x000055a91ec08605 in udp_start_processes (chd_rank=<optimized out>, startup_done=<optimized out>) at net/net_udp.c:497
#5  0x000055a91eaaa589 in main_loop () at main.c:798
#6  main (argc=<optimized out>, argv=<optimized out>) at main.c:1483
(gdb) bt full 
#0  0x00007f8697ae8487 in ds_flusher_routine (ticks=<optimized out>, param=<optimized out>) at dispatch.c:747
        key_cmp = {0x7fffa362f350, 0x55a91ecc4a48}
        val_cmp = {{type = DB_INT, nul = 0, free = -1741502919, val = {int_val = 546014832, bigint_val = 13430916720, double_val = 6.6357545435067967e-314, 
              time_val = 13430916720, string_val = 0x3208b8670 <error: Cannot access memory at address 0x3208b8670>, str_val = {
                s = 0x3208b8670 <error: Cannot access memory at address 0x3208b8670>, len = 0}, blob_val = {
                s = 0x3208b8670 <error: Cannot access memory at address 0x3208b8670>, len = 0}, bitmap_val = 546014832}}, {type = DB_STR, nul = 0, free = 904, val = {
              int_val = 30300000, bigint_val = 30300000, double_val = 1.497018906898977e-316, time_val = 30300000, 
              string_val = 0x1ce5760 <error: Cannot access memory at address 0x1ce5760>, str_val = {s = 0x1ce5760 <error: Cannot access memory at address 0x1ce5760>, 
                len = 122630}, blob_val = {s = 0x1ce5760 <error: Cannot access memory at address 0x1ce5760>, len = 122630}, bitmap_val = 30300000}}}
        key_set = 0x7f8699574b10
        val_set = {type = DB_INT, nul = 0, free = -1722332400, val = {int_val = 0, bigint_val = 0, double_val = 0, time_val = 0, string_val = 0x0, str_val = {s = 0x0, 
              len = 7}, blob_val = {s = 0x0, len = 7}, bitmap_val = 0}}
        list = <optimized out>
        j = <optimized out>
        partition = 0x7f8699724278
        __func__ = "ds_flusher_routine"
#1  0x000055a91eb1327a in handle_timer_job () at timer.c:868
        t = 0x7f86997246d0
        l = <optimized out>
        __FUNCTION__ = "handle_timer_job"
#2  0x000055a91ec03325 in handle_io (idx=<optimized out>, event_type=<optimized out>, fm=<optimized out>) at net/net_udp.c:276
        n = <optimized out>
        read = 32646
        n = <optimized out>
        read = <optimized out>
        __FUNCTION__ = "handle_io"
#3  io_wait_loop_epoll (repeat=0, t=1, h=<optimized out>) at net/../io_wait_loop.h:308
        ret = <optimized out>
        n = <optimized out>
        r = 3
        i = <optimized out>
        e = 0x7f86c91da8c0
        ep_event = {events = 515097640, data = {ptr = 0x99743b0000005500, fd = 21760, u32 = 21760, u64 = 11057527856287470848}}
        fd = <optimized out>
        curr_time = <optimized out>
        __FUNCTION__ = "io_wait_loop_epoll"
#4  0x000055a91ec08605 in udp_start_processes (chd_rank=<optimized out>, startup_done=<optimized out>) at net/net_udp.c:497
        si = <optimized out>
        p_id = <optimized out>
        i = <optimized out>
        p = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--  
        __FUNCTION__ = "udp_start_processes"
#5  0x000055a91eaaa589 in main_loop () at main.c:798
        startup_done = 0x0
        last_check = 0
        rc = <optimized out>
        chd_rank = 20
        startup_done = <optimized out>
        last_check = <optimized out>
        rc = <optimized out>
        __FUNCTION__ = "main_loop"
#6  main (argc=<optimized out>, argv=<optimized out>) at main.c:1483
        c = <optimized out>
        r = <optimized out>
        tmp = 0x7fffa3630eb8 ""
        tmp_len = <optimized out>
        port = <optimized out>
        proto = <optimized out>
        protos_no = <optimized out>
        options = 0x55a91ec50ec8 "f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:p:P:G:W:o:a:k:s:"
        seed = 1064461569
        rfd = <optimized out>
        __FUNCTION__ = "main"
(gdb) 
volga629-1 commented 4 years ago

Doing more testing and seems that opensips modules can't connect properly with Pgpool II. If I connect directly to database port opensips goes in loop on start up. I tried enabled debug, but I don't see what error causing goes into the loop

May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747946]: ERROR:core:db_do_init: could not add connection to the pool
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747938]: ERROR:core:db_do_init: could not add connection to the pool
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747946]: ERROR:clusterer:child_init: cannot initialize database connection
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747938]: ERROR:clusterer:child_init: cannot initialize database connection
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747946]: ERROR:core:init_mod_child: failed to initializing module clusterer, rank 161
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747938]: ERROR:core:init_mod_child: failed to initializing module clusterer, rank 155
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747946]: ERROR:core:send_status_code: write(-1) failed (-1): 32, Broken pipe
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747938]: ERROR:core:tcp_start_processes: init_children failed
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747938]: ERROR:core:send_status_code: write(-1) failed (-1): 32, Broken pipe
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747946]: ERROR:core:start_timer_extra_processes: failed to send -1 status code
May 18 07:25:24 aitossbc01 /usr/sbin/opensips[747938]: ERROR:core:tcp_start_processes: failed to send -1 status code
May 18 07:25:48 aitossbc01 /usr/sbin/opensips[747830]: CRITICAL:core:sig_usr: segfault in process pid: 747830, id: 81

RPM use

Name         : libpq-devel
Version      : 12.2
Release      : 1.fc31
volga629-1 commented 4 years ago

@bogdan-iancu @liviuchircu @razvancrainea Opensips team I know you are working on new release 3.1, but I really need you attention to this ticket, because it absolutely blocked my work.

Thank you.

volga629-1 commented 4 years ago

I tested other deployment and seems like postgresql driver can segfault other modules. That happened with remote database when on loop back everything seems ok.

May 20 05:29:14 dev1-opensips kernel: opensips[114266]: segfault at 3 ip 00007f3b92e67029 sp 00007ffe23824908 error 6 in libcrypto.so.1.1.1d[7f3b92cc6000+1a2000]
May 20 05:29:14 dev1-opensips kernel: Code: 00 48 31 c0 48 83 fe 0f 73 17 48 83 fe 00 74 0c 88 07 48 83 ee 01 48 8d 7f 01 75 f4 f3 c3 0f 1f 00 48 f7 c7 07 00 00 00 74 0c <88
> 07 48 8d 76 ff 48 8d 7f 01 eb eb 48 89 07 48 8d 76 f8 48 f7 c6
May 20 05:29:14 dev1-opensips audit[114266]: ANOM_ABEND auid=4294967295 uid=387 gid=386 ses=4294967295 pid=114266 comm="opensips" exe="/usr/sbin/opensips" sig=11 res=1
May 20 05:31:06 dev1-opensips abrt-notification[114549]: Process 26182 (opensips) crashed in OPENSSL_cleanse()

Posgresql log

2020-05-20 03:42:17.066 EDT [367697] LOG:  could not receive data from client: Connection reset by peer
2020-05-20 03:42:17.101 EDT [367693] LOG:  could not receive data from client: Connection reset by peer
2020-05-20 03:42:17.103 EDT [367688] LOG:  could not receive data from client: Connection reset by peer
2020-05-20 03:42:17.107 EDT [367700] LOG:  could not receive data from client: Connection reset by peer
volga629-1 commented 4 years ago

Here Some interesting log

May 20 04:19:41 dev1-opensips /usr/sbin/opensips[107828]: ERROR:core:send_status_code: write(0) failed (-1): 32, Broken pipe
May 20 04:19:41 dev1-opensips /usr/sbin/opensips[107828]: ERROR:core:start_timer_extra_processes: failed to send 0 status code
May 20 04:21:12 dev1-opensips opensips[107960]: CRITICAL:core:qm_free_dbg: bad pointer 0xa000000107 (out of memory block!) - aborting
May 20 04:24:23 dev1-opensips /usr/sbin/opensips[108298]: ERROR:db_postgres:db_postgres_new_connection: timeout expired
May 20 04:24:23 dev1-opensips /usr/sbin/opensips[108298]: ERROR:db_postgres:db_postgres_new_connection: cleaning up 0x7fd9f3638de0=pkg_free()
May 20 04:24:23 dev1-opensips /usr/sbin/opensips[108299]: ERROR:db_postgres:db_postgres_new_connection: timeout expired
stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 4 years ago

work in progress

volga629-1 commented 4 years ago

@liviuchircu Can you please let me know what possible to do about this issue. It really affect my work.

liviuchircu commented 4 years ago

@volga629-1 I'm doing my best to investigate it during this week.

johandeclercqdemocon commented 4 years ago

Volga, I couldn't reproduce on debian 10 using today's head (06/19/2020).

volga629-1 commented 4 years ago

Here are most recent segfault

mymbacktrace.txt

liviuchircu commented 4 years ago

@volga629-1 sorry for not leaving an update on the 12th -- I took a week off afterwards. During my tests, I couldn't get it to crash when connecting to either local or remote Postgres. Can you please try and bypass pgpool (connect directly to Postgres) and let me know if it works for you, first of all? I suspect the problem is around pgpool.

BTW: regarding your last backtraces, are you somehow running a killall opensips command? This is NOT the correct way to stop OpenSIPS 2.4+ instances. Either do a graceful shutdown with kill <attendant-pid>, or just abruptly stop all processes with killall -9 opensips. Anything in-between will cause problems and weird crashes such as the ones above.

volga629-1 commented 4 years ago

that kill because pid is not responding and systemd will use -, normally in service file we use pkill -u opensips. Regard direct connection I tested and directly top postgresql it failing with segfault

[root@aitossbc01 ccpp-2020-06-20-20:21:59.807009-325989]# gdb coredump
GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 325989]
Reading symbols from /usr/sbin/opensips...
Reading symbols from /usr/lib/debug/usr/sbin/opensips-3.1.0-6.fc31.x86_64.debug...

warning: Ignoring non-absolute filename: <linux-vdso.so.1>
Missing separate debuginfo for linux-vdso.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/31/089c6abd0e7673e62e23adb494c06a60e4884c
Missing separate debuginfo for /usr/lib64/opensips/modules/tls_mgm.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/2e/c18793aada33cf4271049798c9dfc9f9ee445c
Missing separate debuginfo for /usr/lib64/opensips/modules/cachedb_cassandra.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/c8/c6041c0cceaa482d627ae38445c75a4ba61192
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/14/55764bf748129dd38a0d7ed9e3a8f4c0261c65
Missing separate debuginfo for /usr/lib64/opensips/modules/httpd.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/04/799f2c957a47856b5c3c39704d95a065ecd2cf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /usr/lib64/opensips/modules/tls_mgm.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/2e/c18793aada33cf4271049798c9dfc9f9ee445c.debug
Missing separate debuginfo for /usr/lib64/opensips/modules/cachedb_cassandra.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/c8/c6041c0cceaa482d627ae38445c75a4ba61192.debug
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/14/55764bf748129dd38a0d7ed9e3a8f4c0261c65.debug
Missing separate debuginfo for /usr/lib64/opensips/modules/httpd.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/04/799f2c957a47856b5c3c39704d95a065ecd2cf.debug
Core was generated by `/usr/sbin/opensips -P /run/opensips/opensips.pid -f /etc/opensips/opensips.cfg'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50    return ret;
Missing separate debuginfos, use: dnf debuginfo-install libunistring-0.9.10-6.fc31.x86_64
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f0e36c388d9 in __GI_abort () at abort.c:79
#2  0x00007f0e36c934af in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f0e36da2f4b "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f0e36c9aa9c in malloc_printerr (str=str@entry=0x7f0e36da4fc0 "malloc(): invalid size (unsorted)") at malloc.c:5339
#4  0x00007f0e36c9da14 in _int_malloc (av=av@entry=0x7f0e36dd59e0 <main_arena>, bytes=bytes@entry=8192) at malloc.c:3736
#5  0x00007f0e36c9ff55 in __libc_calloc (n=n@entry=1, elem_size=elem_size@entry=8192) at malloc.c:3428
#6  0x00007f0e36c91ed8 in __GI___open_memstream (bufloc=bufloc@entry=0x7ffc2757c120, sizeloc=sizeloc@entry=0x7ffc2757c128) at memstream.c:83
#7  0x00007f0e36d0e9e5 in __vsyslog_internal (pri=134, fmt=0x564fec960bd0 "INFO:core:%s: signal %d received\n", ap=0x7ffc2757c210, mode_flags=2) at ../misc/syslog.c:181
#8  0x00007f0e36d0f033 in __syslog_chk (pri=<optimized out>, flag=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:136
#9  0x0000564fec8416f9 in syslog (__fmt=0x564fec960bd0 "INFO:core:%s: signal %d received\n", __pri=<optimized out>) at /usr/include/bits/syslog.h:31
#10 sig_usr (signo=<optimized out>) at main.c:683
#11 <signal handler called>
#12 _int_malloc (av=av@entry=0x7f0e36dd59e0 <main_arena>, bytes=bytes@entry=48) at malloc.c:3766
#13 0x00007f0e36c9f1c4 in __GI___libc_malloc (bytes=bytes@entry=48) at malloc.c:3058
#14 0x0000564fec8772e2 in dbg_ht_update (htable=htable@entry=0x7ffc2757ca50, file=0x7f0e23793000 "node_info.c", func=0x7f0e23794010 <__FUNCTION__.11705> "load_db_info", 
    line=383, size=8) at mem/mem_dbg_hash.c:44
#15 0x0000564fec87a765 in qm_status_dbg (qm=0x7f0e34c10010) at mem/q_malloc_dyn.h:438
#16 0x0000564fec840bfd in rpc_process_terminate (sender_id=<optimized out>, code=0x0) at main.c:450
#17 0x0000564fec80159b in ipc_handle_job (fd=<optimized out>) at ipc.c:257
#18 0x0000564fec92d4e7 in handle_io (event_type=<optimized out>, idx=<optimized out>, fm=<optimized out>) at net/net_tcp.c:1539
#19 io_wait_loop_epoll (h=<optimized out>, t=<optimized out>, repeat=<optimized out>) at net/../io_wait_loop.h:311
#20 tcp_main_server () at net/net_tcp.c:1703
#21 0x0000564fec931911 in tcp_start_listener () at net/net_tcp.c:2122
#22 0x0000564fec7ae4ac in main_loop () at main.c:819
#23 main (argc=<optimized out>, argv=<optimized out>) at main.c:1488
(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {16384, 139699025443368, 11, 139699025340792, 139698106269696, 139698717485424, 140720968548896, 140720968548912, 139698717485406, 139699025443368, 
            0, 139699025340792, 139698106269696, 139698717671172, 140720968548960, 140720968548992}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007f0e36c388d9 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x200000002, sa_sigaction = 0x200000002}, sa_mask = {__val = {2147483648, 140720308486144, 139699025443368, 11, 
              140720968549408, 0, 139698717485424, 140720968549328, 140720968549344, 139698717485406, 139699025443368, 0, 139699025340792, 140720968549664, 
              139698717671172, 140720968549392}}, sa_flags = 660063280, sa_restorer = 0x7f0e24713eff}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f0e36c934af in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f0e36da2f4b "%s\n") at ../sysdeps/posix/libc_fatal.c:181
        ap = {{gp_offset = 24, fp_offset = 4294967295, overflow_arg_area = 0x7ffc2757bfc0, reg_save_area = 0x7ffc2757bf50}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#3  0x00007f0e36c9aa9c in malloc_printerr (str=str@entry=0x7f0e36da4fc0 "malloc(): invalid size (unsorted)") at malloc.c:5339
No locals.
#4  0x00007f0e36c9da14 in _int_malloc (av=av@entry=0x7f0e36dd59e0 <main_arena>, bytes=bytes@entry=8192) at malloc.c:3736
        next = <optimized out>
        iters = 0
        nb = <optimized out>
        idx = 107
        bin = <optimized out>
        victim = <optimized out>
        size = <optimized out>
        victim_index = <optimized out>
        remainder = <optimized out>
        remainder_size = <optimized out>
        block = <optimized out>
        bit = <optimized out>
        map = <optimized out>
        fwd = <optimized out>
        bck = <optimized out>
        tcache_unsorted_count = 0
        tcache_nb = 0
        tc_idx = 511
        return_cached = 0
--Type <RET> for more, q to quit, c to continue without paging--c
        __PRETTY_FUNCTION__ = "_int_malloc"
#5  0x00007f0e36c9ff55 in __libc_calloc (n=n@entry=1, elem_size=elem_size@entry=8192) at malloc.c:3428
        av = <optimized out>
        oldtop = 0x564fed6d15f0
        p = <optimized out>
        sz = 8192
        csz = <optimized out>
        oldtopsize = 84496
        mem = <optimized out>
        clearsize = <optimized out>
        nclears = <optimized out>
        d = <optimized out>
        bytes = 8192
        hook = <optimized out>
        __PRETTY_FUNCTION__ = "__libc_calloc"
#6  0x00007f0e36c91ed8 in __GI___open_memstream (bufloc=bufloc@entry=0x7ffc2757c120, sizeloc=sizeloc@entry=0x7ffc2757c128) at memstream.c:83
        new_f = 0x564fed6c9750
        buf = <optimized out>
#7  0x00007f0e36d0e9e5 in __vsyslog_internal (pri=134, fmt=0x564fec960bd0 "INFO:core:%s: signal %d received\n", ap=0x7ffc2757c210, mode_flags=2) at ../misc/syslog.c:181
        now_tm = {tm_sec = 0, tm_min = 0, tm_hour = 10, tm_mday = 2, tm_mon = 987, tm_year = 0, tm_wday = 0, tm_yday = 0, tm_isdst = 0, tm_gmtoff = 0, tm_zone = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>}
        now = 140720968550432
        fd = <optimized out>
        f = <optimized out>
        buf = 0x0
        bufsize = 0
        msgoff = <optimized out>
        saved_errno = 9
        failbuf = "6\000\000\000\000\000\000\000(z\311\066\016\177\000\000\000\000\000\000\000\000\000\000h\r\000\000"
        clarg = <optimized out>
#8  0x00007f0e36d0f033 in __syslog_chk (pri=<optimized out>, flag=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:136
        ap = {{gp_offset = 24, fp_offset = 48, overflow_arg_area = 0x7ffc2757c2f0, reg_save_area = 0x7ffc2757c230}}
#9  0x0000564fec8416f9 in syslog (__fmt=0x564fec960bd0 "INFO:core:%s: signal %d received\n", __pri=<optimized out>) at /usr/include/bits/syslog.h:31
No locals.
#10 sig_usr (signo=<optimized out>) at main.c:683
        status = 0
        pid = <optimized out>
        __FUNCTION__ = "sig_usr"
#11 <signal handler called>
No locals.
#12 _int_malloc (av=av@entry=0x7f0e36dd59e0 <main_arena>, bytes=bytes@entry=48) at malloc.c:3766
        p = <optimized out>
        next = 0x564fed67d790
        iters = <optimized out>
        nb = <optimized out>
        idx = 4
        bin = <optimized out>
        victim = 0x564fed675fc0
        size = 30672
        victim_index = <optimized out>
        remainder = 0x564fed676000
        remainder_size = 30608
        block = <optimized out>
        bit = <optimized out>
        map = <optimized out>
        fwd = <optimized out>
        bck = <optimized out>
        tcache_unsorted_count = 0
        tcache_nb = 64
        tc_idx = 2
        return_cached = 0
        __PRETTY_FUNCTION__ = "_int_malloc"
#13 0x00007f0e36c9f1c4 in __GI___libc_malloc (bytes=bytes@entry=48) at malloc.c:3058
        ar_ptr = <optimized out>
        victim = <optimized out>
        hook = <optimized out>
        tbytes = <optimized out>
        tc_idx = <optimized out>
        __PRETTY_FUNCTION__ = "__libc_malloc"
#14 0x0000564fec8772e2 in dbg_ht_update (htable=htable@entry=0x7ffc2757ca50, file=0x7f0e23793000 "node_info.c", func=0x7f0e23794010 <__FUNCTION__.11705> "load_db_info", line=383, size=8) at mem/mem_dbg_hash.c:44
        hash_code = <optimized out>
        entry = 0x0
        new = <optimized out>
#15 0x0000564fec87a765 in qm_status_dbg (qm=0x7f0e34c10010) at mem/q_malloc_dyn.h:438
        f = 0x7f0e34cac900
        i = <optimized out>
        j = <optimized out>
        h = <optimized out>
        unused = <optimized out>
        allocd = {0x0 <repeats 19 times>, 0x564fed680220, 0x0 <repeats 53 times>, 0x564fed67ffa0, 0x0 <repeats 56 times>, 0x564fed675a50, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x564fed67ffe0, 0x0 <repeats 14 times>, 0x564fed675f90, 0x0, 0x564fed67fe20, 0x0 <repeats 40 times>, 0x564fed675990, 0x0 <repeats 23 times>, 0x564fed675c50, 0x0 <repeats 38 times>, 0x564fed680160, 0x564fed6759d0, 0x0 <repeats 12 times>, 0x564fed67fe60, 0x564fed675c90, 0x0 <repeats 46 times>, 0x564fed67fea0, 0x564fed680060, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x564fed675cd0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x564fed675a90, 0x564fed675b50, 0x0 <repeats 60 times>, 0x564fed67ff20, 0x0 <repeats 47 times>, 0x564fed680120, 0x0, 0x564fed675bd0, 0x0 <repeats 583 times>, 0x564fed675d50, 0x564fed6801e0, 0x0 <repeats 54 times>, 0x564fed67ff60, 0x0 <repeats 14 times>, 0x564fed680020, 0x0 <repeats 46 times>, 0x564fed680260, 0x0, 0x0...}
        it = <optimized out>
        __FUNCTION__ = "qm_status_dbg"
#16 0x0000564fec840bfd in rpc_process_terminate (sender_id=<optimized out>, code=0x0) at main.c:450
        __FUNCTION__ = "rpc_process_terminate"
#17 0x0000564fec80159b in ipc_handle_job (fd=<optimized out>) at ipc.c:257
        job = {snd_proc = 0, handler_type = 0, payload1 = 0x564fec840b50 <rpc_process_terminate>, payload2 = 0x0}
        n = <optimized out>
        __FUNCTION__ = "ipc_handle_job"
#18 0x0000564fec92d4e7 in handle_io (event_type=<optimized out>, idx=<optimized out>, fm=<optimized out>) at net/net_tcp.c:1539
        ret = <optimized out>
        ret = <optimized out>
        __FUNCTION__ = "handle_io"
#19 io_wait_loop_epoll (h=<optimized out>, t=<optimized out>, repeat=<optimized out>) at net/../io_wait_loop.h:311
        ep_event = {events = 885688832, data = {ptr = 0x34ca891800007f0e, fd = 32526, u32 = 32526, u64 = 3804003571440058126}}
        curr_time = <optimized out>
        ret = <optimized out>
        n = <optimized out>
        r = 3
        e = 0x7f0e34ce9518
        fd = <optimized out>
        i = <optimized out>
        ret = <optimized out>
        n = <optimized out>
        r = <optimized out>
        i = <optimized out>
        e = <optimized out>
        ep_event = <optimized out>
        fd = <optimized out>
        curr_time = <optimized out>
        __FUNCTION__ = "io_wait_loop_epoll"
#20 tcp_main_server () at net/net_tcp.c:1703
        last_sec = 0
        flags = <optimized out>
        si = <optimized out>
        n = <optimized out>
        __FUNCTION__ = "tcp_main_server"
#21 0x0000564fec931911 in tcp_start_listener () at net/net_tcp.c:2122
        p_id = <optimized out>
        __FUNCTION__ = "tcp_start_listener"
#22 0x0000564fec7ae4ac in main_loop () at main.c:819
        startup_done = 0x0
        last_check = 0
        rc = <optimized out>
        chd_rank = 113
        startup_done = <optimized out>
        last_check = <optimized out>
        rc = <optimized out>
        __FUNCTION__ = "main_loop"
#23 main (argc=<optimized out>, argv=<optimized out>) at main.c:1488
        c = <optimized out>
        r = <optimized out>
        tmp = 0x7ffc27580eb9 ""
        tmp_len = <optimized out>
        port = <optimized out>
        proto = <optimized out>
        protos_no = <optimized out>
        options = 0x564fec9614e8 "f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:p:P:G:W:o:a:k:s:"
        seed = 3409661352
        rfd = 3
        __FUNCTION__ = "main"
(gdb) 
liviuchircu commented 4 years ago

This is another corefile due to the unsafe killall opensips command, so it's no good. Please make each process write its core to a separate file with this command:

echo 1 >/proc/sys/kernel/core_uses_pid
volga629-1 commented 4 years ago

Here are debug of start up The opensips pid is exit 255.

[root@aitossbc01 ~]# systemctl status opensips
● opensips.service - OpenSIPS is a very fast and flexible SIP (RFC3261) server
   Loaded: loaded (/usr/lib/systemd/system/opensips.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2020-06-22 08:03:21 CDT; 12min ago
     Docs: man:opensips
  Process: 968648 ExecStartPre=/usr/bin/touch /run/opensips/keepalive_state (code=exited, status=0/SUCCESS)
  Process: 968649 ExecStartPre=/usr/bin/chown opensips:opensips /run/opensips/keepalive_state (code=exited, status=0/SUCCESS)
  Process: 968650 ExecStartPre=/usr/bin/chmod 660 /run/opensips/keepalive_state (code=exited, status=0/SUCCESS)
  Process: 968651 ExecStart=/usr/sbin/opensips -P /run/opensips/opensips.pid -f /etc/opensips/opensips.cfg -m $S_MEMORY -M $P_MEMORY $OPTIONS (code=killed, signal=TERM)
      CPU: 3.301s

debug-startup.txt

liviuchircu commented 4 years ago

@volga629-1 this issue seems more and more specific to your setup / fine tuning. First, notice how your OpenSIPS successfully connects to Postgres during mod_init() and manages to pull some clustering data:

un 22 08:04:37 [969029] DBG:db_postgres:db_postgres_new_connection: opening connection: postgres://xxxx:xxxx@10.100.101.7:15432/opensips
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_val2str: PQescapeStringConn: in: 9 chars, out: 9 chars
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_submit_query: 0x7f68df41e298 PQsendQuery(select table_version from version where table_name='clusterer')
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_store_result: 0x7f68df41e298 PQresultStatus(PGRES_TUPLES_OK) PQgetResult(0x5643ea11a810)
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_get_columns: 1 columns returned from the query
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_get_columns: RES_NAMES(0x7f68df41e8b8)[0]=[table_version]
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_get_columns: use DB_INT result type
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_convert_rows: allocate for 1 columns 8 bytes in row buffer at 0x7f68df41e930
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_convert_rows: PQgetvalue(0x7f68df41e298,0,0)=[4]
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_convert_rows: allocated 1 bytes for row_buf[0] at 0x7f68df41ea28
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_convert_rows: [0][0] Column[table_version]=[4]
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_str2val: converting INT [4]
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_convert_rows: freeing row_buf[0] at 0x7f68df41ea28
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_convert_rows: freeing row buffer at 0x7f68df41e930
Jun 22 08:04:37 [969029] DBG:db_postgres:free_query: PQclear(0x5643ea11a810) result set
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_submit_query: 0x7f68df41e298 PQsendQuery(select cluster_id from clusterer where node_id=1)
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_store_result: 0x7f68df41e298 PQresultStatus(PGRES_TUPLES_OK) PQgetResult(0x5643ea11a810)
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_get_columns: 1 columns returned from the query
Jun 22 08:04:37 [969029] DBG:db_postgres:db_postgres_get_columns: RES_NAMES(0x7f68df41e8b8)[0]=[cluster_id]

... but, after forking 110+ worker processes, according to these logs:

Jun 22 08:04:42 [969029] INFO:core:shutdown_opensips: process 1(969030) [MI FIFO] terminated, still waiting for 115 more
Jun 22 08:04:42 [969029] INFO:core:shutdown_opensips: process 2(969031) [HTTPD 10.100.101.4:8092] terminated, still waiting for 114 more
Jun 22 08:04:42 [969029] INFO:core:shutdown_opensips: process 3(969032) [time_keeper] terminated, still waiting for 113 more

... you seem to hit some limitation around the Postgres configuration or the firewalling around it, since after 5 seconds (default Postgres timeout), some of those 100+ connect attempts start to fail:

Jun 22 08:04:37 [969091] DBG:db_postgres:db_postgres_new_connection: 0x7f68df464e30=pkg_malloc(96)
Jun 22 08:04:37 [969091] DBG:db_postgres:db_postgres_new_connection: opening connection: postgres://xxxx:xxxx@10.100.101.7:15432/opensips
...
Jun 22 08:04:42 [969091] ERROR:db_postgres:db_postgres_new_connection: timeout expired
Jun 22 08:04:42 [969091] ERROR:db_postgres:db_postgres_new_connection: cleaning up 0x7f68df464e30=pkg_free()
Jun 22 08:04:42 [969091] ERROR:core:db_do_init: could not add connection to the pool
Jun 22 08:04:42 [969091] ERROR:clusterer:child_init: cannot initialize database connection
Jun 22 08:04:42 [969091] ERROR:core:init_mod_child: failed to initializing module clusterer, rank 55
Jun 22 08:04:42 [969091] DBG:core:send_status_code: sent code -1 (1 byte)
Jun 22 08:04:42 [969029] DBG:core:handle_sigs: OpenSIPS exit status = 65280
Jun 22 08:04:42 [969029] INFO:core:handle_sigs: child process 969091 exited normally, status=255

Some advice:

volga629-1 commented 4 years ago

Hello @liviuchircu
Based on you recommendations I checked all firewalls and all sysctl on path to database, make sure no TCP tweaks are set to some critical values. The only help full here are is udp_worker or tcp_worker, but the question how many database connection is open each worker. Seems like is excessive. Cluster of 3 nodes and PgSQL over 150 open connections.

liviuchircu commented 4 years ago

Well, you are forking 115 processes on each cluster node, so 150 open connections seems reasonable, given that the theoretical maximum is 115 x 3 ~= 350.

Maybe try to use the listen = udp:127.0.0.1:5060 use_workers 5 syntax (limit forked workers per listening interface) and reduce the number of forked workers which are not really useful?

volga629-1 commented 4 years ago

The all above solution is reducing capacity. I was be able to start only with reduced amount of tcp/udp_workers. PgSQL max connections is already set to allow pretty decent amount.

liviuchircu commented 4 years ago

Then there is nothing to fix here, as I do not consider the "1 conn per worker" design to be an OpenSIPS bug/limitation, nor could we ever backport a patch that shifts it to a "DB workers vs. SIP workers" design. Each design has its trade-offs. With the current one, you get minimal sharing overhead and maximal performance (each process is doing the DB work individually, with no IPC), at the expense of some extra DB connections.