OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.28k stars 580 forks source link

[BUG] Clusterer sync upon restart of OpenSIPS, usrloc issues #3354

Closed Lt-Flash closed 1 month ago

Lt-Flash commented 6 months ago

OpenSIPS version you are running

version: opensips 3.4.4 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 0a7c2f2ed
main.c compiled on 09:31:12 Feb 29 2024 with gcc 9

Describe the bug

When I restart OpenSIPS that's running in a cluster of three others doing full-sharing registrar scenario I'm getting following errors:

CRITICAL:usrloc:receive_ucontact_insert:
>>> differring rlabels (49 vs. 48, ci: '449EAD1BA20F76C226C095C6BBD2D43DF6F25E22')
It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues

And I can see that not the whole database of usrloc entries is applied to the currently restarted OpenSIPS. Right now we're only using about 2600 records but this is going to be increased up to 50000 records.

To Reproduce

1, Restart one of the OpenSIPSes in the cluster

Expected behavior

I expect all servers to exchange usrloc information without any issues

Relevant System Logs

OpenSIPS configuration:

# Clusterer Module
loadmodule "clusterer.so"
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_info", "cluster_id=1,node_id=3,url=bin:10.x.x.3:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=1,url=bin:10.x.x.1:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=2,url=bin:10.x.x.2:3857,flags=seed")
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "ping_interval", 3)
modparam("clusterer", "ping_timeout", 2500)
modparam("clusterer", "node_timeout", 10)
modparam("clusterer", "sharing_tag", "vip1/1=active")
# USRLoc module
loadmodule "usrloc.so"
modparam("usrloc", "nat_bflag", "NAT")
modparam("usrloc", "working_mode_preset", "full-sharing-cluster")
modparam("usrloc", "use_domain", 0)
modparam("usrloc", "location_cluster", 1)

OS/environment information

Additional context

I've found similar bug report for an old OpenSIPS 2.4.6 and tried to do the same - increased ping_interval to 3 and ping_timeout to 2500, that didn't help at all.

I'm referring to this bug report: https://github.com/OpenSIPS/opensips/issues/1976

NormB commented 6 months ago

Perhaps this is related: https://github.com/OpenSIPS/opensips/issues/3354

Lt-Flash commented 6 months ago

I don't have 'restart persistency' and 'matching mode' set at all and 'use_domain' is set to 0, so a bit different.

github-actions[bot] commented 6 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 6 months ago

Still an issue, this happens many times a day and sometime causes URSLoc to lose registrations on one of the nodes as it auto-removes them as Expired!!!

Lt-Flash commented 5 months ago

@bogdan-iancu Maybe you can help? @liviuchircu Liviu?

This is critical for us, we're losing registrations several times a day! Suddenly we're getting lots of messages like this:

[May  7 12:26:42] CRITICAL:usrloc:receive_ucontact_insert:
>>> differring rlabels (3114 vs. 3152, ci: '381f740599e94dc2ae2b2c559787de91')
It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues

And regos starts to drop, CPU time goes up, calls are dropping and OpenSIPS becomes unresponsive. Nothing helps, I've tried 3.3.9, 3.4.2, 3.4.4 and 3.4.5 versions!

image

Lt-Flash commented 5 months ago

Looks like I've found what this is caused by. I have event routes enabled for usrloc events:

E_UL_CONTACT_INSERT
E_UL_CONTACT_UPDATE
E_UL_CONTACT_DELETE
E_UL_AOR_INSERT
E_UL_AOR_DELETE

If any of these event routes are enabled then during the restart OpenSIPS just loads part of the usrloc entries and then hangs up with a 100% CPU and following messages in the logs:

image

This happens every time when there's significant number of usrloc entries in a full sharing cluster.

As soon as I disable these routes - I can load 100k USRLOC entries in 2-3 seconds without any issues:

image

Hope this helps to resolve this issue!

Lt-Flash commented 5 months ago

@NormB do you have any event routes enabled for the USRLOC module?

Lt-Flash commented 5 months ago

Another thing to add - when I went crazy and set tcp_workers=256 this OpenSIPS server was able to slowly load the usrloc with event routes enabled. But that took about 30 seconds or so to complete.

Lt-Flash commented 5 months ago

After disabling event_route I can restart nodes and they successfully load the usrloc database. But these events are still happening, causing some differences between nodes:

CRITICAL:usrloc:receive_ucontact_insert:
>>> differring rlabels (3376 vs. 3388, ci: '0_3120809578@10.0.44.210')
It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
NormB commented 5 months ago

Yes. Have recently disabled some of them.

Norm Brandinger (c) 973-479-5595

On Wed, May 8, 2024 at 1:02 AM Lt-Flash @.***> wrote:

@NormB https://github.com/NormB do you have any event routes enabled for the USRLOC module?

— Reply to this email directly, view it on GitHub https://github.com/OpenSIPS/opensips/issues/3354#issuecomment-2099745870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQSIWOYNB2OXL2G4Q6VCQ3ZBGWX5AVCNFSM6AAAAABF2ROSI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJZG42DKOBXGA . You are receiving this because you were mentioned.Message ID: @.***>

Lt-Flash commented 5 months ago

Hi everyone, After a very long process of tuning I have come up with the following configuration that seems to match my both goals:

  1. Sync is fast, not 10 records per second
  2. Restart completes without any issues

Had to enable SQL mode and configure OpenSIPS as following:

# USRLoc module
loadmodule "usrloc.so"
modparam("usrloc", "use_domain", 0)
modparam("usrloc", "location_cluster", 1)
modparam("usrloc", "cluster_mode", "full-sharing")
modparam("usrloc", "restart_persistency", "load-from-sql")
modparam("usrloc", "sql_write_mode", "write-back")
modparam("usrloc", "skip_replicated_db_ops", 1)
modparam("usrloc", "db_url","mysql://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/DB")
modparam("usrloc", "timer_interval", 10)

# Clusterer Module
loadmodule "clusterer.so"
modparam("clusterer", "db_mode", 0)
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "my_node_info", "cluster_id=1,node_id=1,url=bin:10.22.20.23:3857,flags=seed")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=2,url=bin:10.22.20.24:3857")
modparam("clusterer", "neighbor_node_info", "cluster_id=1,node_id=3,url=bin:10.22.20.25:3857")
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "ping_interval", 3)
modparam("clusterer", "ping_timeout", 2500)
modparam("clusterer", "node_timeout", 10)
modparam("clusterer", "dispatch_jobs", 0)

# Proto BIN
loadmodule "proto_bin.so"

So the idea is that OpenSIPS would keep a shared DB with all USRLOC contacts but also would keep all USRLOC entries in memory. Upon restart, according to restart_policy - it loads everything from the database. When I was trying to use 'sync-from-cluster' it was loading up the USRLOC DB but it was EXTREMELY slow. To my opinion, setting 'timer_interval' to 10 and especially 'dispatch_jobs' to 0 helped A LOT. I'm still getting errors about programming bug but at least I can now restart my cluster more or less safely. And I only have between 15k to 20k USRLOC entries, by the way.

When I was using just a 'working_mode_preset' without any database I was running in all sorts of weird issue. For example, if you stop one of the three nodes - one of the remaining two nodes was locking up with messages about 'ul_cluster_sync already scheduled before'. You restart that node - and another one hangs up like that. Was also getting these sort of errors:

[May 22 00:22:07] CRITICAL:core:fork_dynamic_tcp_process:
>>> trying to fork one more TCP worker but no free slots in the TCP table (size=32)

It seems you have hit a programming bug.
Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues

Often I could see this:

[May 22 00:20:09] CRITICAL:core:__ipc_send_job: blocking detected while sending job type 0[RPC] on 121  to proc id 27/1466603 [SIP receiver udp:10.22.20.23:5060]
[May 22 00:20:09] ERROR:core:signal_pkg_status: failed to trigger pkg stats for process 27
[May 22 00:20:09] CRITICAL:core:__ipc_send_job: blocking detected while sending job type 0[RPC] on 121  to proc id 27/1466603 [SIP receiver udp:10.22.20.23:5060]
[May 22 00:20:09] ERROR:core:signal_pkg_status: failed to trigger pkg stats for process 27

and maybe this is still relevant, will check tomorrow.

Another warning from tm module that was locking up the whole OpenSIPS:

[May 22 00:29:19] WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 35720 ms ago (now 191640 ms), delaying execution
[May 22 00:29:19] WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 35820 ms ago (now 191740 ms), delaying execution

So it's just crazy, clusterer without DB behaves VERY unstable by the looks of it.

Will keep you posted with any updates.

Lt-Flash commented 5 months ago

Still getting these errors too:

[May 23 01:48:59] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
[May 23 01:50:00] ERROR:usrloc:receive_ucontact_delete: contact 'sip:632201-708053615f42b0038500a01c18fa4840@10.10.60.204:5060;transport=TCP' not found: (ci: 'mU5xRXyG7-4BVoDD8HGCEQ..@216.93.246.122')
[May 23 01:50:00] ERROR:usrloc:receive_ucontact_delete: failed to process replication event. dom: 'location', aor: '632201'
[May 23 01:50:00] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: contact 'sip:829921-73270291a2bc131e04d70a68c8375774@10.13.109.243:5060;transport=TCP' not found: (ci: '1i8vA-7zM4WRVuXTW1RCfQ..@10.152.0.4')
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: failed to process replication event. dom: 'location', aor: '829921'
[May 23 01:55:37] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: contact 'sip:829921-7c1f8b4e386c7ae8eeeb89b456f2adf8@10.13.108.235:5060;transport=TCP' not found: (ci: 'nfAkhPUdRi0pBH8rNfQ5QA..@10.152.0.4')
[May 23 01:55:37] ERROR:usrloc:receive_ucontact_delete: failed to process replication event. dom: 'location', aor: '829921'
[May 23 01:55:37] ERROR:usrloc:receive_binary_packets: failed to process binary packet!
github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 4 months ago

Still an issue, please fix

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 4 months ago

Still an issue

github-actions[bot] commented 3 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Lt-Flash commented 3 months ago

Still an issue

github-actions[bot] commented 2 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions[bot] commented 1 month ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.