OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.29k stars 581 forks source link

[BUG] Possible dialog memory leak on combination $DLG_timeout $DLG_delay_delete #3370

Open volga629-1 opened 7 months ago

volga629-1 commented 7 months ago

Version

 opensips -V
version: opensips 3.4.0 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: f3e0d5333
main.c compiled on 06:17:39 Aug  9 2023 with gcc 12

Issue

In combination of vars $DLG_del_delay and $DLG_timeout causing dialogs with state 5 never be removed, which causing out of memory issues.

Specific vm is 8GB share memory and 12 Gb physical ( not over provisioned)

Apr 21 14:09:02 sbc5 /usr/sbin/opensips[7126]: ERROR:core:hp_shm_malloc_dbg: not enough free shm memory (3406256 bytes left, need 6568), please increase the "-m" command line parameter!
Apr 21 14:09:02 sbc5 /usr/sbin/opensips[7126]: ERROR:tm:sip_msg_cloner: no more share memory
Apr 21 14:09:02 sbc5 /usr/sbin/opensips[7126]: ERROR:tm:new_t: out of mem

Share Memory stats for 24 h

# sbc 5 21 Apr 2024
(opensips-cli): mi get_statistics all
{
    "shmem:total_size": 8589934592,
    "shmem:max_used_size": 278452104,
    "shmem:free_size": 8330746920,
    "shmem:used_size": 212396184,
    "shmem:real_used_size": 259187672,
    "shmem:fragments": 956846,

# sbc 5 22 Apr 2024
(opensips-cli): mi get_statistics all
{
    "shmem:total_size": 8589934592,
    "shmem:max_used_size": 783103000,
    "shmem:free_size": 7813004096,
    "shmem:used_size": 642682864,
    "shmem:real_used_size": 776930496,
    "shmem:fragments": 2774774,

Code


        # CSTA INVITE
        if(!has_totag() && is_method("INVITE") && has_body("application/csta+xml")) {

                ##xlog("[REQ_ROUTE] [$rm] [$cfg_line] CSTA reqest from => [$si] enabling debug\n");
                # Send BYE on dialog timeout
                create_dialog("B");
                # delay ( late BYE )
                $DLG_del_delay = 180;
                $DLG_timeout = 120;

SHM dump opensips-SHM-dump.txt

github-actions[bot] commented 6 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 6 months ago

in progress

github-actions[bot] commented 6 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 6 months ago

In progress

bogdan-iancu commented 6 months ago

Do you have a minimal working cfg reproducing the issue (like how to combine the 2 options in the way that the dialogs get stuck in state 5) ?

volga629-1 commented 6 months ago

Hello Bogdan, This cfg for SIP INVITE.

    # Regular INVITE  without Alert Info header
        if(!has_totag() && is_method("INVITE") && !has_body("application/csta+xml")) {
        # Create dialog
        create_dialog("B");

                $DLG_timeout = 120;

        # Dialog delete delay ( late BYE )
        $DLG_del_delay = 1800;
bogdan-iancu commented 6 months ago

And how does the call terminate? via timeout ? or via BYE ?

volga629-1 commented 6 months ago

Normally completed with BYE.

github-actions[bot] commented 5 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 5 months ago

In progress

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 4 months ago

in progress

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 3 months ago

In progress

github-actions[bot] commented 3 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

luislza commented 2 months ago

Hi,

We're seeing the same behaviour in an active / passive cluster (version 3.4.8) with active/backup sharing tags (no DB) and have narrowed the cause down to one scenario (in our case).

It seems that dialogs that are CANCELED with a response of 487 (no BYE) before answer hang around on the active server until restart - they're shown in the output of dlg_list on the active server.

These dialogs replicate correctly to the passive node and correctly disappear from the output of dlg_list on the passive node.

The odd part in the output of dlg_list on the active node seems to be that they all have the following in common:

        "state": 5,
        "timestart": 0,
        "timeout": 0

No timestart and no timeout.

Restarting the active node clears the dialogs and syncs active dialogs from the passive node correctly.

EDIT: In our case there is no delete delay set and adding one makes no difference to the behaviour. EDIT2: I have an opensips debug log and pcaps available for this scenario that I can e-mail through (due to the sensitive personal information contained).

Luis

github-actions[bot] commented 1 month ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

volga629-1 commented 1 month ago

in progress

ar45 commented 1 month ago

Things to check

  1. DB being used for dialogs.
  2. Writeback mode
  3. Dialogs in memory, how old are they? Did they all exceed $DLG_del_delay ?
  4. What is the $T_fr_inv_timeout set? DLG is waiting for transaction timer to cancel before del?
github-actions[bot] commented 1 month ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

ar45 commented 1 month ago

Still active

github-actions[bot] commented 2 weeks ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

luislza commented 1 week ago

This issue was resolved for us by upgrading from 3.4.8 to 3.4.9 without any config changes.