Open jeremyjpj0916 opened 2 years ago
Hi @jeremyjpj0916, the plugin is currently configured to just ignore failures (which includes an unavailable Redis cluster) and allow requests to pass without any rate limiting.
However, your concern is valid but I am not sure if falling back on instance-level rate limiting would be helpful in all cases. Could you describe if this is helpful to you and in what way?
@chirag-manwani It would be helpful to have a toggle bool setting that if redis unavailable to allow to fall back on local rate limiting mode(this way yourself + existing users can continue existing behavior without this feat. ask being a breaking change in behavior). Because at least rather than opening up flood gates you get some level of kong node level protection.
Ex: I have a TPS rate limit set to 50 tps and I run 3 pods. In even of REDIS failure, at least I am getting 150 TPS max against my backend vs say a noisey client sending 1000 TPS against what the backends SLA could support which could take down a service. Currently on our own kong instances we just do node level rate limiting which when we spawn 3-4 pods makes it super inaccurate since its always rate limit set tps * pod count but its still better than nothing imo.
Am a fan of this plugin though, stumbled across it when deciding what we were gonna do for our new architecture and was looking at oss kong rate limiting vs its enterprise version. And yalls OSS plugin basically built out a lot of the benefits their enterprise plugin does. More power to OSS! Thanks again for yalls efforts thus far on it!
@jeremyjpj0916 Thank you for your kind words, we like to do our best to support the OSS community.
I think your suggestion is a good feature request. Let's try to refine it and figure out the implementation details.
Please add any more questions you think would be a challenge while implementing the solution in this thread.
typing on iphone so sorry for mistakes:
Based on the above ideas sounds like a new: fallback_to_local_policy bool to enable the local policy in the event of failure.
fallback_local_override_limit optional numeric to apply a new local rate policy if desired to not use the global rate #. If not set then goes unused.
then some kinda variable or just smart baked in impl approach to how to retry against redis to ensure its working again too when switching the policy back to redis.
Regarding 1, failure to update or any failure for that matter is fine, but exactly when or after how many failures in what quanta should the policy fallback to local rate limiting?
Regarding 2, after the policy is switched to the fallback, how often should the system check if Redis is up? Every n requests? or every n unit time?
Configurable count of failures but I would say default to 1 failure when this feat. is enabled but allows any int.
I would say it should check every X seconds if possible, but probably be easier to do a every n requests implementation and probably not too bad an approach. Setting that N to right below the bucket size means basically making a check before each bucket would get sent off to redis to enabling that mode back.
Looks good to me. @jeremyjpj0916 Would you be willing to contribute for adding this feature?
Summary
Have not read through the code but was curious:
If redis cluster goes unavailable can the plugin be smart enough just to fall back on local in mem cache rate limiting only and keep the gateway to at least be continually operational until the redis cluster is brought back up?