[BUG] Dialog profiling issues when CacheDB latency present

Matt-Hapner commented 2 years ago

OpenSIPS version you are running

version: opensips 2.4.11 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 6d151d121
main.c compiled on 21:14:51 Aug 25 2022 with gcc 7

Describe the bug

We configured our dialog module with a Redis CacheDB backend in order to use the distributed dialog profiling feature. The Redis cluster configured was located on a box external to the instance on which OpenSIPS was running. In order to test the performance of the system, we introduced some chaos on the instance which added latency to the outgoing network port of the Redis cluster (6379) at the kernel level. The test in question added about 750ms of latency on all traffic to that port for a few minutes and was executed during a period with a substantial amount of load already on OpenSIPS. Every dialog processed during the test had one or more profiles being recorded via set_dlg_profile. The result of the test was that OpenSIPs ran out of shared memory within a matter of a few seconds, the load shot up, and an unprocessable backlog of transactions quickly developed. We have run similar tests in the past where we completely sever the connection to an external CacheDB backend and OpenSIPs has handled them much better. Hopefully, this bug report can assist in making the feature more resilient to latency issues.

To Reproduce

Configure OpenSIPS dialog module
Point the cachedb_url param to an external Redis store
Add set_dlg_profile calls / make use of the profiling feature programmatically
Start OpenSIPS
Start a reasonable amount of traffic
Add network latency on outgoing traffic to the Redis port (at the kernel level)
Inspect OpenSIPS stability issues

Expected behavior

OpenSIPS would temporarily disable a feature dependent on an external CacheDB store if that store is not responsive.

Relevant System Logs

ERROR:dialog:set_dlg_profile: failed to link dialog profile 'XXXXX', ci: c91c188f-7b28-4bd7-8575-8acb96e097c1
ERROR:dialog:link_dlg_profile: cannot insert profile into CacheDB
ERROR:dialog:get_profile_size: cannot fetch profile from CacheDB
ERROR:dialog:get_profile_size: error while fetching cachedb key
ERROR:dialog:w_set_dlg_profile: failed to set profile
ERROR:dialog:destroy_linker: cannot remove profile from CacheDB
ERROR:dialog:ensure_leg_array: oom
ERROR:dialog:new_dlg_val: no more shm mem
ERROR:dialog:store_dlg_value_unsafe: failed to create new dialog value
ERROR:dialog:pv_set_dlg_val: failed to store dialog values

OS/environment information Operating System: amazon-linux OpenSIPS installation: other relevant information:

Additional context

Matt-Hapner commented 2 years ago

tagging @bogdan-iancu for visibility

liviuchircu commented 2 years ago

Hi, @Matt-Hapner and thank you for the report! Let's first clarify some missing pieces regarding your test scenario, so we better emphasize the core issue:

what amount of SHM did you configure your OpenSIPS with? How many UDP workers were you using? And can you suggest a ballpark CPS (calls-per-second) amount that you ran through it during your test which caused the OpenSIPS to end up in the "oom" condition?
did you happen to monitor the values of the inuse_transactions and active_dialogs statistics during your test? If yes, what peak values did they reach? Knowing these values might hint at what occupied the SHM pool: at the end of the day, it could just be transactions piling up, due to poor OpenSIPS configuration in preparation for the stress test
you mention that for your test, you expected OpenSIPS to "... disable a store... if that store is not responsive". However, this contradicts with your testing setup where you say "... the test in question added about 750ms of latency". So the Redis was always responsive, as long as the connection/query timeout was within, say 1 second (which it should be, IIRC). Finally, I still expect the CacheDB layer to be blazing fast if the 6379 port were to be dead (severed), just as in the previous releases, but that's not what you were testing here.

Matt-Hapner commented 2 years ago

Hi @liviuchircu, sure thing - thanks for the thorough analysis so far. I'll display the information for our entire system as well as per the individual instances...

Statistics:
- SHM: 1024mb/instance
- UDP workers: 8/instance
- ballpark CPS: ~300 total (~50/instance - one instance peaked at 296 CPS when most of the other instances had pegged CPUs)
Yes, we can look back on both of those values.
- peak inuse_transactions: Our load balancing layer had ~200k (~50k/instance). The load balancing instances acts as a proxy to our stateful OpenSIPs instances (where the dialogs are processed). So each stateful server would have a smaller portion of that load. However, it still is pretty clear the transactions were piling up as you state. For reference our load balancing layer reported ~10k transactions prior to the test.
- peak active_dialogs: ~75k total (~7k/instance)
That's correct Redis was always responsive. I should've been more specific with my wording. Yes, it appears that we have both the connect_timeout and query_timeout params of the cachedb_redis module configured at 1000ms. This test certainly reveals that we should lower those values. However, we still thought raising these findings to you could be helpful.

Some of the queries to Redis would've exceeded the 1000ms timeout as there was some "jitter" in the tests (i.e. the latency per request ranged from 750ms - 1250ms). Would it be worthwhile to develop some sort of circuit breaker feature that would disable pulling new data from the CacheDB backend when the timeout is exceeded a certain number of times?

liviuchircu commented 2 years ago

I think this "circuit breaker" feature can potentially create more problems than it solves. For example, imagine if the feature would activate at some point for, say 30 seconds, after which the Redis is "re-plugged" in an attempt to see if it works better (at some point, you have to re-plug it anyway). Meanwhile, several hundred dialogs have closed, but Redis was "temporarily disabled", so the "subtract" operations were skipped for the Redis call profiles. Now you have hanging profile counters in Redis... good luck debugging those!

My advice would be to use the native clustering support in order to share call profiles across the OpenSIPS nodes of your platform. There are a multitude of advantages:

asynchronous sharing (replication) of call counters, keeping your PDD as low as possible
no risk of leaked counters due to temporary outages of the DB
[optional] no additional services to maintain (Redis)

github-actions[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions[bot] commented 1 year ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

OpenSIPS / opensips

[BUG] Dialog profiling issues when CacheDB latency present #2945