OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.28k stars 581 forks source link

[BUG] Dialog profiling issues when CacheDB latency present #2945

Closed Matt-Hapner closed 1 year ago

Matt-Hapner commented 2 years ago

OpenSIPS version you are running

version: opensips 2.4.11 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 6d151d121
main.c compiled on 21:14:51 Aug 25 2022 with gcc 7

Describe the bug

We configured our dialog module with a Redis CacheDB backend in order to use the distributed dialog profiling feature. The Redis cluster configured was located on a box external to the instance on which OpenSIPS was running. In order to test the performance of the system, we introduced some chaos on the instance which added latency to the outgoing network port of the Redis cluster (6379) at the kernel level. The test in question added about 750ms of latency on all traffic to that port for a few minutes and was executed during a period with a substantial amount of load already on OpenSIPS. Every dialog processed during the test had one or more profiles being recorded via set_dlg_profile. The result of the test was that OpenSIPs ran out of shared memory within a matter of a few seconds, the load shot up, and an unprocessable backlog of transactions quickly developed. We have run similar tests in the past where we completely sever the connection to an external CacheDB backend and OpenSIPs has handled them much better. Hopefully, this bug report can assist in making the feature more resilient to latency issues.

To Reproduce

  1. Configure OpenSIPS dialog module
  2. Point the cachedb_url param to an external Redis store
  3. Add set_dlg_profile calls / make use of the profiling feature programmatically
  4. Start OpenSIPS
  5. Start a reasonable amount of traffic
  6. Add network latency on outgoing traffic to the Redis port (at the kernel level)
  7. Inspect OpenSIPS stability issues

Expected behavior

OpenSIPS would temporarily disable a feature dependent on an external CacheDB store if that store is not responsive.

Relevant System Logs

ERROR:dialog:set_dlg_profile: failed to link dialog profile 'XXXXX', ci: c91c188f-7b28-4bd7-8575-8acb96e097c1
ERROR:dialog:link_dlg_profile: cannot insert profile into CacheDB
ERROR:dialog:get_profile_size: cannot fetch profile from CacheDB
ERROR:dialog:get_profile_size: error while fetching cachedb key
ERROR:dialog:w_set_dlg_profile: failed to set profile
ERROR:dialog:destroy_linker: cannot remove profile from CacheDB
ERROR:dialog:ensure_leg_array: oom
ERROR:dialog:new_dlg_val: no more shm mem
ERROR:dialog:store_dlg_value_unsafe: failed to create new dialog value
ERROR:dialog:pv_set_dlg_val: failed to store dialog values

OS/environment information Operating System: amazon-linux OpenSIPS installation: other relevant information:

Additional context

Matt-Hapner commented 2 years ago

tagging @bogdan-iancu for visibility

liviuchircu commented 2 years ago

Hi, @Matt-Hapner and thank you for the report! Let's first clarify some missing pieces regarding your test scenario, so we better emphasize the core issue:

Matt-Hapner commented 2 years ago

Hi @liviuchircu, sure thing - thanks for the thorough analysis so far. I'll display the information for our entire system as well as per the individual instances...

Some of the queries to Redis would've exceeded the 1000ms timeout as there was some "jitter" in the tests (i.e. the latency per request ranged from 750ms - 1250ms). Would it be worthwhile to develop some sort of circuit breaker feature that would disable pulling new data from the CacheDB backend when the timeout is exceeded a certain number of times?

liviuchircu commented 2 years ago

I think this "circuit breaker" feature can potentially create more problems than it solves. For example, imagine if the feature would activate at some point for, say 30 seconds, after which the Redis is "re-plugged" in an attempt to see if it works better (at some point, you have to re-plug it anyway). Meanwhile, several hundred dialogs have closed, but Redis was "temporarily disabled", so the "subtract" operations were skipped for the Redis call profiles. Now you have hanging profile counters in Redis... good luck debugging those!

My advice would be to use the native clustering support in order to share call profiles across the OpenSIPS nodes of your platform. There are a multitude of advantages:

github-actions[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions[bot] commented 1 year ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.