Open Lt-Flash opened 7 months ago
Yesterday I've updated OpenSIPS to 3.4.4 and today the same issue has happened again.
OpenSIPS version:
version: opensips 3.4.4 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 0a7c2f2ed
main.c compiled on 09:31:12 Feb 29 2024 with gcc 9
Graph:
opensips-cli -x mi get_statistics dialog:
:
{
"dialog:active_dialogs": 1.844674e+19,
"dialog:early_dialogs": 28,
"dialog:processed_dialogs": 75001,
"dialog:expired_dialogs": 0,
"dialog:failed_dialogs": 40892,
"dialog:create_sent": 11056,
"dialog:update_sent": 12668,
"dialog:delete_sent": 11052,
"dialog:create_recv": 23045,
"dialog:update_recv": 27022,
"dialog:delete_recv": 23037
}
It looks like an underflow here, of the dialog counter. What are the clustering settings for the dialog module?
Hi Bogdan, Here's a full config of that server, that's a full-sharing cluster of three nodes. Interesting that I have another cluster of three nodes in front of this one and that issue doesn't occur there. Also I have a report in our local Telegram group from someone who has the same issue and they're neither using autoscaler nor cluster at all. Thanks!
####### Global Parameters #########
log_level=3
xlog_level=3
stderror_enabled=no
syslog_enabled=yes
syslog_facility=LOG_LOCAL0
auto_scaling_profile = PROFILE_TCP
scale up to 32 on 60% for 4 cycles within 5
scale down to 2 on 10% for 10 cycles
auto_scaling_profile = PROFILE_UDP
scale up to 32 on 60% for 4 cycles within 5
scale down to 2 on 10% for 10 cycles
udp_workers=2 use_auto_scaling_profile PROFILE_UDP
tcp_workers=2 use_auto_scaling_profile PROFILE_TCP
#timer_workers=4
auto_aliases=no
socket=bin:10.X.X.X:3857
socket=udp:10.X.X.X:5060 tag vip
socket=hep_udp:10.X.X.X
mhomed = no
dns=no
rev_dns=no
dns_use_search_list=no
server_header="Server: Server"
user_agent_header="User-Agent: UA"
####### Modules Section ########
#set module path
mpath="/opt/opensips/lib64/opensips/modules/"
# MAX ForWarD module
loadmodule "maxfwd.so"
# SIP MSG OPerationS module
loadmodule "sipmsgops.so"
# FIFO Management Interface
loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_name", "/opt/opensips/var/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0666)
# SIGNALING module
loadmodule "signaling.so"
# StateLess module
loadmodule "sl.so"
# Transaction Module
loadmodule "tm.so"
modparam("tm", "fr_timeout", 30)
modparam("tm", "fr_inv_timeout", 90)
modparam("tm", "restart_fr_on_each_reply", 1)
modparam("tm", "onreply_avp_mode", 1)
# TextOPS module
loadmodule "textops.so"
# Record Route Module
loadmodule "rr.so"
modparam("rr", "append_fromtag", 1)
modparam("rr", "enable_double_rr", 0)
modparam("rr", "enable_socket_mismatch_warning", 1)
# USRLoc module
loadmodule "usrloc.so"
modparam("usrloc", "nat_bflag", "NAT")
modparam("usrloc", "working_mode_preset", "full-sharing-cluster")
modparam("usrloc", "use_domain", 0)
modparam("usrloc", "location_cluster", 1)
#modparam("usrloc", "contact_refresh_timer", true)
# Clusterer Module
loadmodule "clusterer.so"
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_info", "cluster_id=1,node_id=1,url=bin:10.X.X.1:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=2,url=bin:10.X.X.2:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=3,url=bin:10.X.X.3:3857,flags=seed")
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "ping_interval", 1)
modparam("clusterer", "ping_timeout", 500)
modparam("clusterer", "node_timeout", 10)
modparam("clusterer", "sharing_tag", "vip1/1=active")
# Dialog module
loadmodule "dialog.so"
modparam("dialog", "dialog_replication_cluster", 1)
modparam("dialog", "profile_replication_cluster", 1)
modparam("dialog", "cluster_auto_sync", 1)
modparam("dialog", "reinvite_ping_interval", 30)
# AVPOPS module
loadmodule "avpops.so"
# Math
loadmodule "mathops.so"
modparam ("mathops", "decimal_digits", 3)
# UAC
loadmodule "uac.so"
# AUTH module
loadmodule "auth.so"
modparam("auth", "username_spec", "$var(username)")
modparam("auth", "password_spec", "$var(password)")
modparam("auth", "calculate_ha1", 0)
# Local cache
loadmodule "cachedb_local.so"
modparam("cachedb_local", "cache_clean_period", 10)
# Redis CacheDB
loadmodule "cachedb_redis.so"
modparam("cachedb_redis", "cachedb_url","redis:OS://:CONNECT_STRING/")
# Exec module
loadmodule "exec.so"
# Event Route module
loadmodule "event_route.so"
# Event Routing module
loadmodule "event_routing.so"
# Event Stream
loadmodule "event_stream.so"
# Registrar module
loadmodule "registrar.so"
modparam("registrar", "received_avp", "$avp(received)")
modparam("registrar", "min_expires", 60)
modparam("registrar", "default_expires", 120)
modparam("registrar", "max_expires", 3600)
modparam("registrar", "max_contacts", 1)
modparam("registrar", "attr_avp", "$avp(event_attr)")
modparam("registrar", "pn_enable", true)
modparam("registrar", "pn_providers", "apns, fcm, webpush")
modparam("registrar", "pn_enable_purr", true)
# REST API client
loadmodule "rest_client.so"
modparam("rest_client", "connection_timeout", 1)
modparam("rest_client", "ssl_verifypeer", 0)
modparam("rest_client", "ssl_verifyhost", 0)
modparam("rest_client", "curl_http_version", 4)
# JSON module
loadmodule "json.so"
# HTTPd Server
loadmodule "httpd.so"
modparam("httpd", "ip", "10.X.X.X")
modparam("httpd", "port", XXXX)
# HTTP Management Interface
loadmodule "mi_http.so"
modparam("mi_http", "root", "os_mi_http")
# Prometheus module
loadmodule "prometheus.so"
modparam("prometheus", "statistics", "all")
# Proto UDP
loadmodule "proto_udp.so"
# Proto BIN
loadmodule "proto_bin.so"
# Proto HEP
loadmodule "proto_hep.so"
modparam("proto_hep", "hep_capture_id", 10XXX)
modparam("proto_hep", "hep_id", "[hid]10.X.X.X:9060;transport=udp;version=3")
# Tracer
loadmodule "tracer.so"
modparam("tracer", "trace_on", 1)
modparam("tracer", "trace_id", "[tid]uri=hep:hid")
Hi!
I have similar problems with active dialogs metric.
My setting dialog module
loadmodule "dialog.so" modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "default_timeout", 3600) # 1 hours timeout modparam("dialog", "db_mode", 1) modparam("dialog", "db_url", "{{ connect_mysql }}") modparam("dialog", "table_name", "{{ dialog_table }}") modparam("dialog", "dialog_replication_cluster", {{ cluster_id }}) modparam("dialog", "cachedb_url", "{{ connect_local_dialog }}") modparam("dialog", "profiles_with_value", "some list of shared profiles;")
version: opensips 3.3.7 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll, sigio_rt, select. git revision: unknown main.c compiled on with gcc 10
after update to version: opensips 3.4.3 (x86_64/linux) flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535 poll method support: poll, epoll, sigio_rt, select. git revision: 3a891b908 main.c compiled on with gcc 10
and settings dialog module: loadmodule "dialog.so" modparam("dialog", "dlg_match_mode", 1) modparam("dialog", "default_timeout", 3600) # 1 hours timeout modparam("dialog", "db_mode", 1) modparam("dialog", "db_url", "{{ connect_mysql }}") modparam("dialog", "table_name", "{{ dialog_table }}")
modparam("dialog", "cachedb_url", "{{ connect_local_dialog }}") modparam("dialog", "profiles_with_value", "some list of non shared profiles;")
problem is gone
Yep, but this doesn't work for me in OpenSIPS 3.4.4 :(
Any debug we can do for you, @bogdan-iancu ?
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Noticed that from time to time dialogs return back to normal and then hit the overflow counter again:
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@Lt-Flash , officially 3.3 is EOL, only 3.2 and 3.4 are still maintained. Why moving to 3.4 is such a no-go for you ?
Bump. Still an issue, can't monitor OpenSIPS properly. Please help someone!
@Lt-Flash , officially 3.3 is EOL, only 3.2 and 3.4 are still maintained. Why moving to 3.4 is such a no-go for you ?
Sorry, mate, somehow missed this reply! I have actually moved to 3.4 and the issue still happens, this is from 3.4:
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still doesn't work properly in 3.3.9 and 3.4.4.
Could you try a hard reset of your cluster, meaning stopping all opensips nodes in the cluster and while down, purge the dialog tables of all nodes? Restart with basically 0 ongoing calls and see if the issue happens again
Hi Bogdan, I'm not using any DB for dialog tables or registrar or anything at all on these servers, it's a full-sharing cluster with in-memory only storage, no CacheDB except for REDIS, please have a look at my config in the original message.
I tried to do a full restart and that didn't help. Also, I'm already on a 3.4.5 version since yesterday and still getting the same overflow issues:
Update here - I've disabled dialog replication within the cluster and now dialogs do not skyrocket. I can see active dialogs per node, not the total amount, but there are no issues with that. Looks like another bug in cluster sync.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Confirmed that if I disable dialog sync within the cluster then the issue goes away so this is definitely something to do with the clusterer module.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue with clusterer enabled
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue with the latest OpenSIPS 3.4.7 build from GIT.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue.
OpenSIPS version you are running
Describe the bug
I'm using a dialog module with just a memory storage, no MySQL or any other storage is connected. Also I'm using a full-sharing cluster of three nodes. Initially dialogs were showing statistics correctly but then suddenly it started to show a huge number of active dialogs. In
opensips-cli -x mi dlg_list
it only shows real dialog number, but inopensips-cli -x mi get_statistics all
it show following (related to dialog module):Here's a graph built by Grafana using Prometheus module for OpenSIPS:
To Reproduce
Just start OpenSIPS and wait for some time, usually within several hours it starts to display this number.
Expected behavior
Active dialogs statistic value should be showing correct number of active dialogs.
Relevant System Logs
OS/environment information
Additional context