Quick Question - Githubissues

jeremyjpj0916 commented 2 years ago

Summary

Have not read through the code but was curious:

If redis cluster goes unavailable can the plugin be smart enough just to fall back on local in mem cache rate limiting only and keep the gateway to at least be continually operational until the redis cluster is brought back up?

chirag-manwani commented 2 years ago

Hi @jeremyjpj0916, the plugin is currently configured to just ignore failures (which includes an unavailable Redis cluster) and allow requests to pass without any rate limiting.

However, your concern is valid but I am not sure if falling back on instance-level rate limiting would be helpful in all cases. Could you describe if this is helpful to you and in what way?

jeremyjpj0916 commented 2 years ago

@chirag-manwani It would be helpful to have a toggle bool setting that if redis unavailable to allow to fall back on local rate limiting mode(this way yourself + existing users can continue existing behavior without this feat. ask being a breaking change in behavior). Because at least rather than opening up flood gates you get some level of kong node level protection.

Ex: I have a TPS rate limit set to 50 tps and I run 3 pods. In even of REDIS failure, at least I am getting 150 TPS max against my backend vs say a noisey client sending 1000 TPS against what the backends SLA could support which could take down a service. Currently on our own kong instances we just do node level rate limiting which when we spawn 3-4 pods makes it super inaccurate since its always rate limit set tps * pod count but its still better than nothing imo.

Am a fan of this plugin though, stumbled across it when deciding what we were gonna do for our new architecture and was looking at oss kong rate limiting vs its enterprise version. And yalls OSS plugin basically built out a lot of the benefits their enterprise plugin does. More power to OSS! Thanks again for yalls efforts thus far on it!

chirag-manwani commented 2 years ago

@jeremyjpj0916 Thank you for your kind words, we like to do our best to support the OSS community.

I think your suggestion is a good feature request. Let's try to refine it and figure out the implementation details.

What condition should trigger the fallback to local rate limiting?
Should the plugin return back to global rate limiting once Redis is up? What condition should be used to check if Redis is healthy?
Should the rate-limiting parameters (eg. rate limit = 100) be the same for global rate limiting and the fallback local rate limiting? or should it be say 100/10 = 10 (where 10 is the number of nodes or the best estimate of the number of nodes)?
What extra config params should be added in the plugin configuration?

Please add any more questions you think would be a challenge while implementing the solution in this thread.

jeremyjpj0916 commented 2 years ago

typing on iphone so sorry for mistakes:

Non successful delivery of data to redis or a read from redis that failed.
Yes, I think code needs to retry the send/pull to get back to “redis” mode by using the redis cluster when detected working again goes back to that. Frequency tbd.
Ooo thats interesting… I like it as a option maybe but should not be forced if you get what I mean. Only way the 100/10 works well is if you have a good round robin policy distributing the traffic almost perfectly even between kong nodes. The reality is this is almost never the case but I do like it as a option.

Based on the above ideas sounds like a new: fallback_to_local_policy bool to enable the local policy in the event of failure.

fallback_local_override_limit optional numeric to apply a new local rate policy if desired to not use the global rate #. If not set then goes unused.

then some kinda variable or just smart baked in impl approach to how to retry against redis to ensure its working again too when switching the policy back to redis.

chirag-manwani commented 2 years ago

Regarding 1, failure to update or any failure for that matter is fine, but exactly when or after how many failures in what quanta should the policy fallback to local rate limiting?

Regarding 2, after the policy is switched to the fallback, how often should the system check if Redis is up? Every n requests? or every n unit time?

jeremyjpj0916 commented 2 years ago

Configurable count of failures but I would say default to 1 failure when this feat. is enabled but allows any int.
I would say it should check every X seconds if possible, but probably be easier to do a every n requests implementation and probably not too bad an approach. Setting that N to right below the bucket size means basically making a check before each bucket would get sent off to redis to enabling that mode back.

chirag-manwani commented 2 years ago

Looks good to me. @jeremyjpj0916 Would you be willing to contribute for adding this feature?

dream11 / kong-scalable-rate-limiter

Quick Question #14

Summary