OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.26k stars 576 forks source link

[FEATURE] proper rate limit replication #2900

Open achalkov opened 2 years ago

achalkov commented 2 years ago

Hello, team! I have opensips installation with 2 frontend nodes which manages inbound connections (like SBC) and want to add global rate limits to be able to limit inbound CPS (which should work globally for installation), without any shared backend like standalone cachedb (like redis etc) to store values. This is pretty simple task if you want to add rate limits on every single opensips node separately, but things becomes much more complicated if you want to share this limits across the cluster: right now we have ability to replicate pipes. Lets say we have pipe RL_GLOBAL, with time window = 1 second (timer_interval = 1), which should count calls received per second, and reject calls after limit of 10 calls in this second. Like this:

    if (is_method("INVITE")) {
        if (!rl_check("RL_GLOBAL", 10, "TAILDROP")) {
            sl_send_reply(503, "Server Unavailable");
            exit;
        }
    }

On one separate node it works perfectly: if we get 12 CPS - last 2 calls will be dropped, perfect. Now lets try to replicate pipes across the cluster of 2 nodes and use the same code (from above) from both of them. When we are sending 10 calls per second to 1st node - it replicates pipe RL_GLOBAL to 2nd node constantly with counter = 10, so if at the same time we'll send another 10 CPS to 2nd node - it will reject all of them (as it already have counter = 10 inside this pipe), then it will increment counter in its own local RL_GLOBAL by 10 and replicate it to 1st node. 1st node receives this pipe with non-zero counter and applies it, rejecting all received calls, then replicates it to node 2 which also rejects all calls, etc etc. This all leads to situation when after 1st second of the same load on 2 nodes, which uses the same pipe, both nodes rejects all incoming calls. I'm not 100% sure that I'm totally right about how this all described above mechanics works, but we tried to send 10 CPS to 1st node, then 1CPS to second node and result was - all calls rejected on 2nd node and every 10th call rejected on 1st node (if send 10 cps to 1st and 2 cps to 2nd - 1st rejects 2 of 10 calls every second and 2nd rejects all, etc), so I think I'm somewhere not too far from truth in my thoughts. As per described logic - pipes replication have no sense because mutual replication inside the cluster breaks counters inside the pipes.

Describe the solution you'd like

It would be great to have ability to add "shared" pipes. How it should work (at least in my head): it is possible to add pipe, which will be initialized inside the cluster. When we are checking call against this pipe - module should increment counter inside this pipe locally, and then send message via clusterer, which will increase counter inside this pipe on every other node in cluster. If so - every node will have pipe, which will have information about their local amount of calls (which was checked against this pipe) plus sum of calls, which was checked against this pipe on all nodes inside the cluster, inside the given time frame (counter will be dropped after "timer_interval" exactly how it works with current pipes now). That should provide us ability to set rate limits globally for cluster, which will be tracked more or less accurate and will not lead to counter conflicts.

Implementation

- Component: ratelimit module, clusterer module - Type: new type of pipes, maybe initialized in module params - Name: shared pipe **Additional context**

Actually, I tried to rewrite rate limit logic in routing script manually:

on incoming call - store or increment data in cachedb_local (using cache_add()) for one second, then send clusterer request to cluster (via cluster_broadcast_req()), on which other nodes should increment the same counter in their cache, then check this counter against limit and if it greater or equal - drop the call, but it was blocked by the #2899. That should work when #2899 will be solved, but its not optimal way anyway.

github-actions[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

achalkov commented 2 years ago

not stale