Closed Lt-Flash closed 1 month ago
Perhaps this is related: https://github.com/OpenSIPS/opensips/issues/3354
I don't have 'restart persistency' and 'matching mode' set at all and 'use_domain' is set to 0, so a bit different.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue, this happens many times a day and sometime causes URSLoc to lose registrations on one of the nodes as it auto-removes them as Expired!!!
@bogdan-iancu Maybe you can help? @liviuchircu Liviu?
This is critical for us, we're losing registrations several times a day! Suddenly we're getting lots of messages like this:
[May 7 12:26:42] CRITICAL:usrloc:receive_ucontact_insert:
>>> differring rlabels (3114 vs. 3152, ci: '381f740599e94dc2ae2b2c559787de91')
It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
And regos starts to drop, CPU time goes up, calls are dropping and OpenSIPS becomes unresponsive. Nothing helps, I've tried 3.3.9, 3.4.2, 3.4.4 and 3.4.5 versions!
Looks like I've found what this is caused by. I have event routes enabled for usrloc events:
E_UL_CONTACT_INSERT
E_UL_CONTACT_UPDATE
E_UL_CONTACT_DELETE
E_UL_AOR_INSERT
E_UL_AOR_DELETE
If any of these event routes are enabled then during the restart OpenSIPS just loads part of the usrloc entries and then hangs up with a 100% CPU and following messages in the logs:
This happens every time when there's significant number of usrloc entries in a full sharing cluster.
As soon as I disable these routes - I can load 100k USRLOC entries in 2-3 seconds without any issues:
Hope this helps to resolve this issue!
@NormB do you have any event routes enabled for the USRLOC module?
Another thing to add - when I went crazy and set tcp_workers=256
this OpenSIPS server was able to slowly load the usrloc with event routes enabled. But that took about 30 seconds or so to complete.
After disabling event_route I can restart nodes and they successfully load the usrloc database. But these events are still happening, causing some differences between nodes:
CRITICAL:usrloc:receive_ucontact_insert:
>>> differring rlabels (3376 vs. 3388, ci: '0_3120809578@10.0.44.210')
It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
Yes. Have recently disabled some of them.
Norm Brandinger (c) 973-479-5595
On Wed, May 8, 2024 at 1:02 AM Lt-Flash @.***> wrote:
@NormB https://github.com/NormB do you have any event routes enabled for the USRLOC module?
— Reply to this email directly, view it on GitHub https://github.com/OpenSIPS/opensips/issues/3354#issuecomment-2099745870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQSIWOYNB2OXL2G4Q6VCQ3ZBGWX5AVCNFSM6AAAAABF2ROSI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJZG42DKOBXGA . You are receiving this because you were mentioned.Message ID: @.***>
Hi everyone, After a very long process of tuning I have come up with the following configuration that seems to match my both goals:
Had to enable SQL mode and configure OpenSIPS as following:
# USRLoc module
loadmodule "usrloc.so"
modparam("usrloc", "use_domain", 0)
modparam("usrloc", "location_cluster", 1)
modparam("usrloc", "cluster_mode", "full-sharing")
modparam("usrloc", "restart_persistency", "load-from-sql")
modparam("usrloc", "sql_write_mode", "write-back")
modparam("usrloc", "skip_replicated_db_ops", 1)
modparam("usrloc", "db_url","mysql://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/DB")
modparam("usrloc", "timer_interval", 10)
# Clusterer Module
loadmodule "clusterer.so"
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "my_node_info", "cluster_id=1,node_id=1,url=bin:10.22.20.23:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=2,url=bin:10.22.20.24:3857")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=3,url=bin:10.22.20.25:3857")
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "ping_interval", 3)
modparam("clusterer", "ping_timeout", 2500)
modparam("clusterer", "node_timeout", 10)
modparam("clusterer", "dispatch_jobs", 0)
# Proto BIN
loadmodule "proto_bin.so"
So the idea is that OpenSIPS would keep a shared DB with all USRLOC contacts but also would keep all USRLOC entries in memory. Upon restart, according to restart_policy - it loads everything from the database. When I was trying to use 'sync-from-cluster' it was loading up the USRLOC DB but it was EXTREMELY slow. To my opinion, setting 'timer_interval' to 10 and especially 'dispatch_jobs' to 0 helped A LOT. I'm still getting errors about programming bug but at least I can now restart my cluster more or less safely. And I only have between 15k to 20k USRLOC entries, by the way.
When I was using just a 'working_mode_preset' without any database I was running in all sorts of weird issue. For example, if you stop one of the three nodes - one of the remaining two nodes was locking up with messages about 'ul_cluster_sync already scheduled before'. You restart that node - and another one hangs up like that. Was also getting these sort of errors:
[May 22 00:22:07] CRITICAL:core:fork_dynamic_tcp_process:
>>> trying to fork one more TCP worker but no free slots in the TCP table (size=32)
It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
Often I could see this:
[May 22 00:20:09] CRITICAL:core:__ipc_send_job: blocking detected while sending job type 0[RPC] on 121 to proc id 27/1466603 [SIP receiver udp:10.22.20.23:5060]
[May 22 00:20:09] ERROR:core:signal_pkg_status: failed to trigger pkg stats for process 27
[May 22 00:20:09] CRITICAL:core:__ipc_send_job: blocking detected while sending job type 0[RPC] on 121 to proc id 27/1466603 [SIP receiver udp:10.22.20.23:5060]
[May 22 00:20:09] ERROR:core:signal_pkg_status: failed to trigger pkg stats for process 27
and maybe this is still relevant, will check tomorrow.
Another warning from tm module that was locking up the whole OpenSIPS:
[May 22 00:29:19] WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 35720 ms ago (now 191640 ms), delaying execution
[May 22 00:29:19] WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 35820 ms ago (now 191740 ms), delaying execution
So it's just crazy, clusterer without DB behaves VERY unstable by the looks of it.
Will keep you posted with any updates.
Still getting these errors too:
[May 23 01:48:59] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
[May 23 01:50:00] ERROR:usrloc:receive_ucontact_delete: contact 'sip:632201-708053615f42b0038500a01c18fa4840@10.10.60.204:5060;transport=TCP' not found: (ci: 'mU5xRXyG7-4BVoDD8HGCEQ..@216.93.246.122')
[May 23 01:50:00] ERROR:usrloc:receive_ucontact_delete: failed to process replication event. dom: 'location', aor: '632201'
[May 23 01:50:00] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: contact 'sip:829921-73270291a2bc131e04d70a68c8375774@10.13.109.243:5060;transport=TCP' not found: (ci: '1i8vA-7zM4WRVuXTW1RCfQ..@10.152.0.4')
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: failed to process replication event. dom: 'location', aor: '829921'
[May 23 01:55:37] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: contact 'sip:829921-7c1f8b4e386c7ae8eeeb89b456f2adf8@10.13.108.235:5060;transport=TCP' not found: (ci: 'nfAkhPUdRi0pBH8rNfQ5QA..@10.152.0.4')
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: failed to process replication event. dom: 'location', aor: '829921'
[May 23 01:55:37] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue, please fix
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Still an issue
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.
OpenSIPS version you are running
Describe the bug
When I restart OpenSIPS that's running in a cluster of three others doing full-sharing registrar scenario I'm getting following errors:
And I can see that not the whole database of usrloc entries is applied to the currently restarted OpenSIPS. Right now we're only using about 2600 records but this is going to be increased up to 50000 records.
To Reproduce
1, Restart one of the OpenSIPSes in the cluster
Expected behavior
I expect all servers to exchange usrloc information without any issues
Relevant System Logs
OpenSIPS configuration:
OS/environment information
Additional context
I've found similar bug report for an old OpenSIPS 2.4.6 and tried to do the same - increased ping_interval to 3 and ping_timeout to 2500, that didn't help at all.
I'm referring to this bug report: https://github.com/OpenSIPS/opensips/issues/1976