3scale / APIcast

3scale API Gateway
Apache License 2.0
305 stars 171 forks source link

Rate limit policy does not take into account changes in the config when using Redis #667

Closed davidor closed 6 years ago

davidor commented 6 years ago

The rate limit policy has some limitations when configured to use Redis.

For example:

or

In order to solve this problems, we need to store more information in Redis and perform some checks every time Apicast boots with the rate limit policy config. For example, the policy needs to check the existing limits and see if it needs to increase the TTL of the limiting keys, decrease it, remove them, etc.

y-tabata commented 6 years ago

If it's acceptable to reset rate limits when the gateway restarts, we can add timestamp or a random string to the key in the initialization.

davidor commented 6 years ago

@y-tabata I think it will depend a lot on the use case.

For limits with small periods (seconds, minutes) it might be acceptable. However, 3scale supports big periods for limits (like a month, or a year). In that case, resetting the counters and the limiters is not acceptable.

y-tabata commented 6 years ago

Regarding connections and leaky bucket, it's no problem to reset rate limits because when the gateway restarts, there is no connection.

So we need to focus on fixed window only. There are 5 cases on each parameter as illustrated in the following example.

window:

old window current new window expected window reset?
10 5 3 3 yes
10 5 5 5 yes
10 5 7 7 yes
10 5 10 5 no
10 5 15 15 yes

count:

old count current new count expected count reset?
10 5 3 3 yes
10 5 5 5 yes
10 5 7 7 yes
10 5 10 5 no
10 5 15 15 yes

Regarding window, if the new window is shorter than the current window, we have to reset the rate limit. If the new window is longer than the current window and the window value is changed, it is thought that users restarted the gateway in order to reset the rate limit. It is strange that the window in the real environment is still following to the current window although users change the window value. So we have to reset the rate limit. But if the new window equals the old window, it is not thought that users restarted the gateway in order to reset the rate limit. So we don't reset the rate limit. The same is true of count. So we store the count and window in Redis, and if the new window equals the old window and the new count equals the old count, then we don't reset rate limits. Otherwise, we reset rate limits.

mikz commented 6 years ago

@y-tabata you are thinking that the key is going to be known without the request.

That is not going to be the case all the time. We plan to add templating to the key, so you can for example rate limit by {{ ngx.var.remote_addr }} to use client IP address. That value would be interpolated in the rewrite phase - the key would not be known until the evaluation.

I think it is first important to define what we want - business logic wise - and then proceed to the technical solution.

"connection"

https://github.com/openresty/lua-resty-limit-traffic/blob/ef080731b80fe00a385a4aed8a7f7f575d01c152/lib/resty/limit/count.lua#L47 uses decrements to maintain the connection limit. That means on change of the limit the current window would first have to expire.

I understand this might be confusing and probably counter effective. Example scenario: the limit is set too high for too long window and due to an attack it needs to be lowered - need to expire the window would be pretty limiting factor preventing quick reaction to an attack.

Proposal: use increments instead of decrements to maintain current connection count and apply the limit after that.

"leaky bucket"

This limiter is configured with per second rate limit and burst. I think it is ok to ignore this as its super low granularity would cause it to reload instantly.

"fixed window"

https://github.com/openresty/lua-resty-limit-traffic/blob/ef080731b80fe00a385a4aed8a7f7f575d01c152/lib/resty/limit/count.lua#L47 - it is implemented the same as connection rate limiter. Decrements from the limit default value.

Switching to maintaining current value and let the limit outside would solve the problem.

Summary

I think we should investigate cost of changing to increment model and see if we can push those changes upstream. If not, we could simply fork the library.

Also, there is already some discussion about changing limits on the fly: https://github.com/openresty/lua-resty-limit-traffic/issues/23

y-tabata commented 6 years ago

@mikz Thank you for your advice.

so you can for example rate limit by {{ ngx.var.remote_addr }} to use client IP address.

If we use client IP address, the scope will be client. It's good templating when we will implement the client scope in this policy in the future, but considering the service or the global scope, we'd like to use other templating.

"connection"

https://github.com/openresty/lua-resty-limit-traffic/blob/ef080731b80fe00a385a4aed8a7f7f575d01c152/lib/resty/limit/count.lua#L47 uses decrements to maintain the connection limit. That means on change of the limit the current window would first have to expire.

The link is "fixed window"'s link. I think "connection" is the increment model.

I think we should investigate cost of changing to increment model and see if we can push those changes upstream. If not, we could simply fork the library.

Also, there is already some discussion about changing limits on the fly: openresty/lua-resty-limit-traffic#23

In my understanding, if a user changes parameters of a policy (from UI), apicast restarts and the init and the init_worker phases are executed. Is my understanding correct? If so, we don't need to consider changing limits on the fly.

One thing we have to consider is when a user changes parameters of the "fixed window" limiter, we should reset the remaining or not. #703 resets it but you think it's not acceptable to reset it, don't you?

The issue (openresty/lua-resty-limit-traffic#23) looks not active. How about using the library, leaving it as decrement model? For example, when the count is changed 10 to 20, we increment the (remaining) value by 10. When the window is changed 3000 to 2000, we decrement the expiration by 1000.

mikz commented 6 years ago

@y-tabata

If we use client IP address, the scope will be client. It's good templating when we will implement the client scope in this policy in the future, but considering the service or the global scope, we'd like to use other templating.

Client IP is just an example. With templating the value can be any parameter from the request. That might include for example parsed JWT token. We don't want to extend the policy to add scope for each kind people might want to rate limit by.

Service and global scope still make sense. For example in our SaaS we won't allow people using global scope. So even if they will have templated rate limiting key {{ ngx.var.remote_addr }} because of the service scope it would get prefixed with service id into something like service_id:XX/{{ ngx.var.remote_addr }}. And now imagine scenario when you want to combine those into for example rate limiting path per client ip: {{ ngx.var.remote_addr }}/{{ ngx.var.uri }}. The nginx examples work on the same principal. The rate limiting key is known only when the request arrives because it is using variables like $remote_addr.

We want to allow users to rate limit by anything available in the context (any nginx variable, parsed JWT, etc.) without us doing the work.

The link is "fixed window"'s link. I think "connection" is the increment model.

You are right. Sorry for the confusion. Indeed the "connection" is incrementing: https://github.com/openresty/lua-resty-limit-traffic/blob/ef080731b80fe00a385a4aed8a7f7f575d01c152/lib/resty/limit/conn.lua#L53

In my understanding, if a user changes parameters of a policy (from UI), apicast restarts and the init and the init_worker phases are executed. Is my understanding correct? If so, we don't need to consider changing limits on the fly.

Unfortunately no. init and init_worker are executed only when APIcast starts/restarts. When it reloads configuration they are not executed (because it does not restart).

We also have to consider cases where there is several gateways running with different configurations (either rolling updates or just because of restarts or pulling config at different times).

One thing we have to consider is when a user changes parameters of the "fixed window" limiter, we should reset the remaining or not. #703 resets it but you think it's not acceptable to reset it, don't you?

So if rate limit is changed from 50 to 100 it does not make sense that the app could do more than 100 when that happens. If we would keep only "remaining" and reset it this could happen, right?

I think the only sane way of making changes/upgrades work is to track the current usage and just apply the limit independently instead of tracking the remaining. One of the nice benefits would be that you can have two gateways with different limits using the same key. Maybe for internal/external network.

The issue (openresty/lua-resty-limit-traffic#23) looks not active. It does not, but still has some useful information. We can try to revive it by saying we would like to change the model to increment so it can accept changes to the limit and see if maintainers are ok with it.

How about using the library, leaving it as decrement model? For example, when the count is changed 10 to 20, we increment the (remaining) value by 10. When the window is changed 3000 to 2000, we decrement the expiration by 1000.

I think that is hard problem. For example if there is 10 gateways and they start pulling configuration at different times, how would you know that the change was applied or not ? What about just gateway restarts?

Keeping state is sync is hard problem. And from my experience it is easier to flip the problem on its head and try to eliminate it rather than fight it.

I'll try to revive openresty/lua-resty-limit-traffic#23 and see if they'd be ok with switching to increment model. IMHO that is the best path forward and would unify how the rate limiters are implemented.

y-tabata commented 6 years ago

@mikz

And now imagine scenario when you want to combine those into for example rate limiting path per client ip: {{ ngx.var.remote_addr }}/{{ ngx.var.uri }}. The nginx examples work on the same principal. The rate limiting key is known only when the request arrives because it is using variables like $remote_addr.

I understand. So users can rate limit by "client" or "method" and so on. In order to do that, we have to add the rewrite method to interpolate the templatings and add the list of templatings to the json schema (key { name }) or define some rule to use templatings, for example, put the templating string in double brackets. {{ ngx.var.remote_addr }}

Unfortunately no. init and init_worker are executed only when APIcast starts/restarts. When it reloads configuration they are not executed (because it does not restart).

I understand. So in the access method, we have to check whether configuration was changed or not, and call on the fly method like set_conn. Regarding connection, set_delay is insufficient. Regarding fixed window, there are no on the fly methods. We should add these methods to lua-resty-limit.

One of the nice benefits would be that you can have two gateways with different limits using the same key. Maybe for internal/external network.

This is very nice benefit, I think.

I'll try to revive openresty/lua-resty-limit-traffic#23 and see if they'd be ok with switching to increment model. IMHO that is the best path forward and would unify how the rate limiters are implemented.

Thanks for your help.

mikz commented 6 years ago

@y-tabata

I understand. So users can rate limit by "client" or "method" and so on. In order to do that, we have to add the rewrite method to interpolate the templatings and add the list of templatings to the json schema (key { name }) or define some rule to use templatings, for example, put the templating string in double brackets. {{ ngx.var.remote_addr }}

Yes, the template expansion would be performed in the rewrite phase. And as for how to template, the plan is to use liquid - the same format 3scale uses on developer portal and for nginx configuration of APIcast. I plan to work on this and have a prototype by the end of this week.

I understand. So in the access method, we have to check whether configuration was changed or not, and call on the fly method like set_conn. Regarding connection, set_delay is insufficient. Regarding fixed window, there are no on the fly methods. We should add these methods to lua-resty-limit.

I think we might not be needing that. If both connection and fixed window rate limiters would use increments, then the only other variables are the limit and the time window.

I think we should be encoding the time window to the key itself. So for example for UNIX timestamp 1526378712 and time window 3600 we would round it to 1526378400 and use it as part of the key: 1526378400_{{ngx.var.remote_addr}}. Then changing the time window would also change the key and we would not have to worry about changing expiration. This could be done by another interpolation or by some flag in the configuration, or by scope.

I'd really try to stick just using increments, so we don't have to solve the hard problem of synchronizing changes across the fleet of APIcast instances.

mikz commented 6 years ago

I made a simple patch https://github.com/openresty/lua-resty-limit-traffic/pull/34 to change it to increments. I'd appreciate some review scrutiny especially regarding atomicity and race conditions.

mikz commented 6 years ago

https://github.com/3scale/apicast/pull/758 updates APIcast to use that forked version.

y-tabata commented 6 years ago

Can we close this by #703?

mikz commented 6 years ago

Closed by https://github.com/3scale/apicast/pull/703 and https://github.com/3scale/apicast/pull/758