Closed Matt-Hapner closed 1 year ago
tagging @bogdan-iancu for visibility
Hi, @Matt-Hapner and thank you for the report! Let's first clarify some missing pieces regarding your test scenario, so we better emphasize the core issue:
inuse_transactions
and active_dialogs
statistics during your test? If yes, what peak values did they reach? Knowing these values might hint at what occupied the SHM pool: at the end of the day, it could just be transactions piling up, due to poor OpenSIPS configuration in preparation for the stress test1
second (which it should be, IIRC). Finally, I still expect the CacheDB layer to be blazing fast if the 6379
port were to be dead (severed), just as in the previous releases, but that's not what you were testing here.Hi @liviuchircu, sure thing - thanks for the thorough analysis so far. I'll display the information for our entire system as well as per the individual instances...
inuse_transactions
: Our load balancing layer had ~200k (~50k/instance). The load balancing instances acts as a proxy to our stateful OpenSIPs instances (where the dialogs are processed). So each stateful server would have a smaller portion of that load. However, it still is pretty clear the transactions were piling up as you state. For reference our load balancing layer reported ~10k transactions prior to the test.active_dialogs
: ~75k total (~7k/instance)connect_timeout
and query_timeout
params of the cachedb_redis
module configured at 1000ms. This test certainly reveals that we should lower those values. However, we still thought raising these findings to you could be helpful.Some of the queries to Redis would've exceeded the 1000ms timeout as there was some "jitter" in the tests (i.e. the latency per request ranged from 750ms - 1250ms). Would it be worthwhile to develop some sort of circuit breaker feature that would disable pulling new data from the CacheDB backend when the timeout is exceeded a certain number of times?
I think this "circuit breaker" feature can potentially create more problems than it solves. For example, imagine if the feature would activate at some point for, say 30
seconds, after which the Redis is "re-plugged" in an attempt to see if it works better (at some point, you have to re-plug it anyway). Meanwhile, several hundred dialogs have closed, but Redis was "temporarily disabled", so the "subtract" operations were skipped for the Redis call profiles. Now you have hanging profile counters in Redis... good luck debugging those!
My advice would be to use the native clustering support in order to share call profiles across the OpenSIPS nodes of your platform. There are a multitude of advantages:
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.
OpenSIPS version you are running
Describe the bug
We configured our dialog module with a Redis CacheDB backend in order to use the distributed dialog profiling feature. The Redis cluster configured was located on a box external to the instance on which OpenSIPS was running. In order to test the performance of the system, we introduced some chaos on the instance which added latency to the outgoing network port of the Redis cluster (6379) at the kernel level. The test in question added about 750ms of latency on all traffic to that port for a few minutes and was executed during a period with a substantial amount of load already on OpenSIPS. Every dialog processed during the test had one or more profiles being recorded via set_dlg_profile. The result of the test was that OpenSIPs ran out of shared memory within a matter of a few seconds, the load shot up, and an unprocessable backlog of transactions quickly developed. We have run similar tests in the past where we completely sever the connection to an external CacheDB backend and OpenSIPs has handled them much better. Hopefully, this bug report can assist in making the feature more resilient to latency issues.
To Reproduce
set_dlg_profile
calls / make use of the profiling feature programmaticallyExpected behavior
OpenSIPS would temporarily disable a feature dependent on an external CacheDB store if that store is not responsive.
Relevant System Logs
OS/environment information Operating System: amazon-linux OpenSIPS installation: other relevant information:
Additional context