OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.27k stars 578 forks source link

[BUG] Active dialogs skyrocket #3321

Open Lt-Flash opened 7 months ago

Lt-Flash commented 7 months ago

OpenSIPS version you are running

version: opensips 3.3.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 39c4e245f
main.c compiled on 20:50:23 Feb 15 2024 with gcc 9

Describe the bug

I'm using a dialog module with just a memory storage, no MySQL or any other storage is connected. Also I'm using a full-sharing cluster of three nodes. Initially dialogs were showing statistics correctly but then suddenly it started to show a huge number of active dialogs. In opensips-cli -x mi dlg_list it only shows real dialog number, but in opensips-cli -x mi get_statistics all it show following (related to dialog module):

    "dialog:active_dialogs": 1.844674e+19,
    "dialog:early_dialogs": 12,
    "dialog:processed_dialogs": 64701,
    "dialog:expired_dialogs": 0,
    "dialog:failed_dialogs": 36074,
    "dialog:create_sent": 9115,
    "dialog:update_sent": 10587,
    "dialog:delete_sent": 9121,
    "dialog:create_recv": 19457,
    "dialog:update_recv": 22791,
    "dialog:delete_recv": 19483,

Here's a graph built by Grafana using Prometheus module for OpenSIPS:

image

To Reproduce

Just start OpenSIPS and wait for some time, usually within several hours it starts to display this number.

Expected behavior

Active dialogs statistic value should be showing correct number of active dialogs.

Relevant System Logs

OS/environment information

Additional context

Lt-Flash commented 7 months ago

Yesterday I've updated OpenSIPS to 3.4.4 and today the same issue has happened again.

OpenSIPS version:

version: opensips 3.4.4 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 0a7c2f2ed
main.c compiled on 09:31:12 Feb 29 2024 with gcc 9

Graph: image

opensips-cli -x mi get_statistics dialog::

{
    "dialog:active_dialogs": 1.844674e+19,
    "dialog:early_dialogs": 28,
    "dialog:processed_dialogs": 75001,
    "dialog:expired_dialogs": 0,
    "dialog:failed_dialogs": 40892,
    "dialog:create_sent": 11056,
    "dialog:update_sent": 12668,
    "dialog:delete_sent": 11052,
    "dialog:create_recv": 23045,
    "dialog:update_recv": 27022,
    "dialog:delete_recv": 23037
}
bogdan-iancu commented 7 months ago

It looks like an underflow here, of the dialog counter. What are the clustering settings for the dialog module?

Lt-Flash commented 7 months ago

Hi Bogdan, Here's a full config of that server, that's a full-sharing cluster of three nodes. Interesting that I have another cluster of three nodes in front of this one and that issue doesn't occur there. Also I have a report in our local Telegram group from someone who has the same issue and they're neither using autoscaler nor cluster at all. Thanks!

####### Global Parameters #########
log_level=3
xlog_level=3
stderror_enabled=no
syslog_enabled=yes
syslog_facility=LOG_LOCAL0

auto_scaling_profile = PROFILE_TCP
     scale up to 32 on 60% for 4 cycles within 5
     scale down to 2 on 10% for 10 cycles

auto_scaling_profile = PROFILE_UDP
     scale up to 32 on 60% for 4 cycles within 5
     scale down to 2 on 10% for 10 cycles

udp_workers=2 use_auto_scaling_profile PROFILE_UDP
tcp_workers=2 use_auto_scaling_profile PROFILE_TCP

#timer_workers=4

auto_aliases=no

socket=bin:10.X.X.X:3857
socket=udp:10.X.X.X:5060 tag vip
socket=hep_udp:10.X.X.X

mhomed = no

dns=no
rev_dns=no
dns_use_search_list=no

server_header="Server: Server"
user_agent_header="User-Agent: UA"

####### Modules Section ########

#set module path
mpath="/opt/opensips/lib64/opensips/modules/"

# MAX ForWarD module
loadmodule "maxfwd.so"

# SIP MSG OPerationS module
loadmodule "sipmsgops.so"

# FIFO Management Interface
loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_name", "/opt/opensips/var/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0666)

# SIGNALING module
loadmodule "signaling.so"

# StateLess module
loadmodule "sl.so"

# Transaction Module
loadmodule "tm.so"
modparam("tm", "fr_timeout", 30)
modparam("tm", "fr_inv_timeout", 90)
modparam("tm", "restart_fr_on_each_reply", 1)
modparam("tm", "onreply_avp_mode", 1)

# TextOPS module
loadmodule "textops.so"

# Record Route Module
loadmodule "rr.so"
modparam("rr", "append_fromtag", 1)
modparam("rr", "enable_double_rr", 0)
modparam("rr", "enable_socket_mismatch_warning", 1)

# USRLoc module
loadmodule "usrloc.so"
modparam("usrloc", "nat_bflag", "NAT")
modparam("usrloc", "working_mode_preset", "full-sharing-cluster")
modparam("usrloc", "use_domain", 0)
modparam("usrloc", "location_cluster", 1)
#modparam("usrloc", "contact_refresh_timer", true)

# Clusterer Module
loadmodule "clusterer.so"
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_info", "cluster_id=1,node_id=1,url=bin:10.X.X.1:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=2,url=bin:10.X.X.2:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=3,url=bin:10.X.X.3:3857,flags=seed")
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "ping_interval", 1)
modparam("clusterer", "ping_timeout", 500)
modparam("clusterer", "node_timeout", 10)
modparam("clusterer", "sharing_tag", "vip1/1=active")

# Dialog module
loadmodule "dialog.so"
modparam("dialog", "dialog_replication_cluster", 1)
modparam("dialog", "profile_replication_cluster", 1)
modparam("dialog", "cluster_auto_sync", 1)
modparam("dialog", "reinvite_ping_interval", 30)

# AVPOPS module
loadmodule "avpops.so"

# Math
loadmodule "mathops.so"
modparam ("mathops", "decimal_digits", 3)

# UAC
loadmodule "uac.so"

# AUTH module
loadmodule "auth.so"
modparam("auth", "username_spec", "$var(username)")
modparam("auth", "password_spec", "$var(password)")
modparam("auth", "calculate_ha1", 0)

# Local cache
loadmodule "cachedb_local.so"
modparam("cachedb_local", "cache_clean_period", 10)

# Redis CacheDB
loadmodule "cachedb_redis.so"
modparam("cachedb_redis", "cachedb_url","redis:OS://:CONNECT_STRING/")

# Exec module
loadmodule "exec.so"

# Event Route module
loadmodule "event_route.so"

# Event Routing module
loadmodule "event_routing.so"

# Event Stream
loadmodule "event_stream.so"

# Registrar module
loadmodule "registrar.so"
modparam("registrar", "received_avp", "$avp(received)")
modparam("registrar", "min_expires", 60)
modparam("registrar", "default_expires", 120)
modparam("registrar", "max_expires", 3600)
modparam("registrar", "max_contacts", 1)
modparam("registrar", "attr_avp", "$avp(event_attr)")
modparam("registrar", "pn_enable", true)
modparam("registrar", "pn_providers", "apns, fcm, webpush")
modparam("registrar", "pn_enable_purr", true)

# REST API client
loadmodule "rest_client.so"
modparam("rest_client", "connection_timeout", 1)
modparam("rest_client", "ssl_verifypeer", 0)
modparam("rest_client", "ssl_verifyhost", 0)
modparam("rest_client", "curl_http_version", 4)

# JSON module
loadmodule "json.so"

# HTTPd Server
loadmodule "httpd.so"
modparam("httpd", "ip", "10.X.X.X")
modparam("httpd", "port", XXXX)

# HTTP Management Interface
loadmodule "mi_http.so"
modparam("mi_http", "root", "os_mi_http")

# Prometheus module
loadmodule "prometheus.so"
modparam("prometheus", "statistics", "all")

# Proto UDP
loadmodule "proto_udp.so"

# Proto BIN
loadmodule "proto_bin.so"

# Proto HEP
loadmodule "proto_hep.so"
modparam("proto_hep", "hep_capture_id", 10XXX)
modparam("proto_hep", "hep_id", "[hid]10.X.X.X:9060;transport=udp;version=3")

# Tracer
loadmodule "tracer.so"
modparam("tracer", "trace_on", 1)
modparam("tracer", "trace_id", "[tid]uri=hep:hid")
rybushkindmitry commented 7 months ago

Hi!

I have similar problems with active dialogs metric.

My setting dialog module

loadmodule "dialog.so" modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "default_timeout", 3600) # 1 hours timeout modparam("dialog", "db_mode", 1) modparam("dialog", "db_url", "{{ connect_mysql }}") modparam("dialog", "table_name", "{{ dialog_table }}") modparam("dialog", "dialog_replication_cluster", {{ cluster_id }}) modparam("dialog", "cachedb_url", "{{ connect_local_dialog }}") modparam("dialog", "profiles_with_value", "some list of shared profiles;")

version: opensips 3.3.7 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll, sigio_rt, select. git revision: unknown main.c compiled on with gcc 10

after update to version: opensips 3.4.3 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll, sigio_rt, select. git revision: 3a891b908 main.c compiled on with gcc 10

and settings dialog module: loadmodule "dialog.so" modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "default_timeout", 3600) # 1 hours timeout modparam("dialog", "db_mode", 1) modparam("dialog", "db_url", "{{ connect_mysql }}") modparam("dialog", "table_name", "{{ dialog_table }}")

modparam("dialog", "dialog_replication_cluster", {{ cluster_id }})

modparam("dialog", "cachedb_url", "{{ connect_local_dialog }}") modparam("dialog", "profiles_with_value", "some list of non shared profiles;")

problem is gone

Lt-Flash commented 7 months ago

Yep, but this doesn't work for me in OpenSIPS 3.4.4 :(

Lt-Flash commented 6 months ago

Any debug we can do for you, @bogdan-iancu ?

github-actions[bot] commented 6 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 6 months ago

Noticed that from time to time dialogs return back to normal and then hit the overflow counter again:

image

github-actions[bot] commented 5 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

bogdan-iancu commented 5 months ago

@Lt-Flash , officially 3.3 is EOL, only 3.2 and 3.4 are still maintained. Why moving to 3.4 is such a no-go for you ?

Lt-Flash commented 5 months ago

Bump. Still an issue, can't monitor OpenSIPS properly. Please help someone!

Lt-Flash commented 5 months ago

@Lt-Flash , officially 3.3 is EOL, only 3.2 and 3.4 are still maintained. Why moving to 3.4 is such a no-go for you ?

Sorry, mate, somehow missed this reply! I have actually moved to 3.4 and the issue still happens, this is from 3.4:

image

github-actions[bot] commented 5 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 5 months ago

Still doesn't work properly in 3.3.9 and 3.4.4.

bogdan-iancu commented 5 months ago

Could you try a hard reset of your cluster, meaning stopping all opensips nodes in the cluster and while down, purge the dialog tables of all nodes? Restart with basically 0 ongoing calls and see if the issue happens again

Lt-Flash commented 5 months ago

Hi Bogdan, I'm not using any DB for dialog tables or registrar or anything at all on these servers, it's a full-sharing cluster with in-memory only storage, no CacheDB except for REDIS, please have a look at my config in the original message.

I tried to do a full restart and that didn't help. Also, I'm already on a 3.4.5 version since yesterday and still getting the same overflow issues:

image

Lt-Flash commented 5 months ago

Update here - I've disabled dialog replication within the cluster and now dialogs do not skyrocket. I can see active dialogs per node, not the total amount, but there are no issues with that. Looks like another bug in cluster sync.

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 4 months ago

Confirmed that if I disable dialog sync within the cluster then the issue goes away so this is definitely something to do with the clusterer module.

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 4 months ago

Still an issue with clusterer enabled

github-actions[bot] commented 3 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 3 months ago

Still an issue

github-actions[bot] commented 2 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 1 month ago

Still an issue with the latest OpenSIPS 3.4.7 build from GIT.

github-actions[bot] commented 4 weeks ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 1 week ago

Still an issue.