Closed KevFan closed 2 years ago
There are two things I've started investigating:
tl;dr I don't see how this came to be quite yet, but we should make the model better suited to our needs.
Currently, a Limit
implements PartialEq
& Hash
taking the struct's max_value
field into account. Nothing seems to actually really require this… but further more, I'd argue that what makes a Limit
"unique" within a Namespace
is irrespective of its max_value
, but rather the Namespace
it's defined in (more on that later tho) and its time window (seconds
field), conditions
and variables
its being evaluated against. There is also a name
field that's optional. I'm unclear whether the "same Limit
" (i.e. its seconds
, conditions
and variables
field) would be able to exist within a Namespace
under different name
s… In any case, when all discriminators have been applied to narrow how to limit access to a route, if multiple Limit
s are found, the "smaller one" would prevail… as it'd be the first to trip the condition of not being counter_is_within_limits
.
Probably clearly defining what uniquely identifies a Limit would be the first step. And then clearly codify this within the domain model of Limitador (i.e. Namespace
contains Limit
s which have Counter
s - today, that relationship is "only" explicit within the InMemoryStorage
as HashMap<Namespace, HashMap<Limit, HashSet<Counter>>>>
). So that a Namespace.Set<Limit>
would only ever contain valid and applicable Limit
s.
Since the domain model isn't self-sufficient, but relies on some Storage
implementation to "obey" by these rules, the other implementations do require a little more effort. In the case of Redis we'd ideally change the key_for_limits_of_namespace
Set to a Map of these serialized_limit_definition
to their non-identity fields (max_value
and nullable name
?), so to provide uniqueness of Limit
s within a namespace at the "right level".
Given Limitador
is being configured using the "config file", there is code that does "only keep" the intersection of existing (i.e. already in Redis) Limit
s and the ones in the config file (without resetting their counters), while non-existing one within Redis are created based of the config file. So that should not be the source of the state as observed in this issue.
Could it be some Limitador
instance was configured in way so to have the two Limit
s defined? The first entry added and Limitador got restarted before the previous one was removed?
The behavior should be change for (at least) these tests to pass:
@KevFan not related to the issue, I think the limits objects are not correctly defined.
{
"namespace": "apicast-ratelimit",
"max_value": 6945,
"seconds": 60,
"name": null,
"conditions": [
"generic_key == slowpath"
],
"variables": [
"generic_key"
]
}
generic_key
descriptor entry key should not be in variables
. variables
is meant to define one counter per each value of the descriptor key defined in there. However, at the same time in conditions
the limit is constrained to "generic_key == slowpath"
, Thus, the generic_key
in variables
is useless. It should be something like this:
{
"namespace": "apicast-ratelimit",
"max_value": 6945,
"seconds": 60,
"name": null,
"conditions": [
"generic_key == slowpath"
],
"variables": []
}
I have been trying to reproduce the issue with no luck. Definitely there is an issue, but we need more insight because with the given info, I cannot reproduce the issue.
Deploy local kind cluster
kind create cluster --name limitador-system --config kind-cluster.yaml
Deploy redis in the default
namespace
kubectl apply -f redis-deployment.yaml
Content of redis-deployment.yaml
Deploy limitador in the default
namespace with a limits file
kubectl apply -f limitador-deployment.yaml
Content of limitador-deployment.yaml
Limits file initially
- namespace: test_namespace
max_value: 10
seconds: 60
conditions:
- "req.method == GET"
variables:
- user_id
Expose limitador API locally
k port-forward service/limitador -n default 8080:8080
Get limits
curl http://127.0.0.1:8080/limits/test_namespace 2>/dev/null | yq e -P
- namespace: test_namespace
max_value: 10
seconds: 60
name: null
conditions:
- req.method == GET
variables:
- user_id
Update config map, only max_value
from 10
to 20
cat <<EOF >new_limits.yaml
> - namespace: test_namespace
> max_value: 20
> seconds: 60
> conditions:
> - "req.method == GET"
> variables:
> - user_id
> EOF
k create configmap limits-file --from-file=limits-file.yaml=new_limits.yaml -o yaml --dry-run=client | k replace -f -
If you try to inspect limits, max_value
is still 10
curl http://127.0.0.1:8080/limits/test_namespace 2>/dev/null | yq e -P
- namespace: test_namespace
max_value: 10
seconds: 60
name: null
conditions:
- req.method == GET
variables:
- user_id
Rollout limitador to read the new configmap
k rollout restart deployment/limitador
Limitador gets restarted
k get pods
NAME READY STATUS RESTARTS AGE
limitador-9dc4879c4-45nr6 1/1 Running 0 7s <-- new born!
redist-8464b6fbc9-74gtx 1/1 Running 0 14m
you may need to restart port-forward
command
k port-forward service/limitador -n default 8080:8080
Get the new limits. Only one limit and the max_value
has been replaced with new value 20
curl http://127.0.0.1:8080/limits/test_namespace 2>/dev/null | yq e -P
- namespace: test_namespace
max_value: 20
seconds: 60
name: null
conditions:
- req.method == GET
variables:
- user_id
That tells me that Limitador is working as expected. And also tells me that the current implementation defining what uniquely identifies a Limit may be right. If we removed the max_value
from the PartialEq
or Hash
, the very same process described above would end up not changing anything. I think that the customer would expect that if you change the max_value
, the limit should be replaced.
Internally the existing limit is deleted and the new one created. Effectively, the counter has been reset and I also think that this is somewhat expected. Maybe not ideal, but expected. Changing the max_value
and keep the counter as it is may be nice to have, especially true for long sized time periods like yearly
or monthly
.
I do not know if keeping the counter is feasible. I need to learn more about Limitador's codebase. I see basic methods to get_limits, add_limit, delete limit. But I do not see update_limit
.
@alexsnaps I would love your feedback
@eguzki Oh okay, thanks, we'll update that :+1:
Yeah, I think this is likely an edge case that is difficult to replicate. In particular, RHOAM customer installs use external redis instances. In the OHSS ticket, interesting it notes that the elasticache redis instance had a pending alert firing about it's availability for a period. This may have been a factor to this edge case. Perhaps it was unable to delete the current limits before reloading the limits from the file. But this is just me guessing :thinking:
That tells me that Limitador is working as expected. And also tells me that the current implementation defining what uniquely identifies a Limit may be right. If we removed the max_value from the PartialEq or Hash, the very same process described above would end up not changing anything. I think that the customer would expect that if you change the max_value, the limit should be replaced.
Right, more would need to be done to update the limit. In that scenario, we could "just" update the max value on all Limits to be kept when configure_with
is invoked. Also, as mentioned on Slack we'd ideally need to implement the same contract in Redis (that currently enforces the Set semantic on identity based off the serialized form), but that require a change in the format we persistent things into Redis. Or we could keep that discrepancy between the two storage model and enforce uniqueness at only "in-memory" (i.e. not in the InMemoryStorage
, but on the hydrated domain instances, in this case at the Namespace
level?). That'd save us from requiring people to delete their entire Redis' state when upgrading to this newer version.
I do not know if keeping the counter is feasible
We could totally keep the counters around in that case (doing the above would actually yield that very result). That being said, I think possibly revamping the domain entirely might be desirable. I opened #70 for us to discuss these issues. There are multiple improvements we could achieve by slowly introducing these changes (e.g. higher concurrency, less blocking, less memory consumption, ...) and I think coming up with a self contained domain model might be desirable (i.e. one that doesn't even let you represent illegal state). How to best model that tho is open for debate. Also, decoupling the persistence (as in to Redis) and possibly the REST API JSON could also be beneficial depending on the domain model we end up defining.
tl;dr we might just want to close this and start the discussion on the data modeling as part of #70 ?
Looking forward to that
See #88
Following up from:
It was noticed that there was a duplicate entry in Limitator that targeted the same domain in RHOAM
RHOAM limits are loaded into limitador by mounting a config map of the limits yaml and setting the
LIMITS_FILE
env var to this file path [1]. As a customer changes their rate limit quota, this config map is updated and causes a re-scale of limitador pods to pick up the new values.Typically the old entry is updated / removed in order to use the new value, however in this case another entry was added. This has caused the rate limiting to use the old value rather than the newest value.
[1] https://github.com/integr8ly/integreatly-operator/blob/master/pkg/products/marin3r/rateLimitService.go#L173-L202