Duplicate limits with same namespace, conditions and variables are allowed

KevFan commented 2 years ago

Following up from:

It was noticed that there was a duplicate entry in Limitator that targeted the same domain in RHOAM

$ oc -n redhat-rhoam-observability-operator rsh observability-operator-controller-manager-856bc6f7bc-pqxvp 
sh-4.4$ curl 10.11.40.16:8080/limits/apicast-ratelimit
[{"namespace":"apicast-ratelimit","max_value":6945,"seconds":60,"name":null,"conditions":["generic_key == slowpath"],"variables":["generic_key"]},{"namespace":"apicast-ratelimit","max_value":70,"seconds":60,"name":null,"conditions":["generic_key == slowpath"],"variables":["generic_key"]}]

```json [ { "namespace": "apicast-ratelimit", "max_value": 6945, "seconds": 60, "name": null, "conditions": [ "generic_key == slowpath" ], "variables": [ "generic_key" ] }, { "namespace": "apicast-ratelimit", "max_value": 70, "seconds": 60, "name": null, "conditions": [ "generic_key == slowpath" ], "variables": [ "generic_key" ] } ] ```

RHOAM limits are loaded into limitador by mounting a config map of the limits yaml and setting the LIMITS_FILE env var to this file path [1]. As a customer changes their rate limit quota, this config map is updated and causes a re-scale of limitador pods to pick up the new values.

Typically the old entry is updated / removed in order to use the new value, however in this case another entry was added. This has caused the rate limiting to use the old value rather than the newest value.

[1] https://github.com/integr8ly/integreatly-operator/blob/master/pkg/products/marin3r/rateLimitService.go#L173-L202

alexsnaps commented 2 years ago

There are two things I've started investigating:

[ ] Why is this even possible?
[ ] How did these two "versions" of the "same" limit even make it in?

tl;dr I don't see how this came to be quite yet, but we should make the model better suited to our needs.

Why is this even possible?

Currently, a Limit implements PartialEq & Hash taking the struct's max_value field into account. Nothing seems to actually really require this… but further more, I'd argue that what makes a Limit "unique" within a Namespace is irrespective of its max_value, but rather the Namespace it's defined in (more on that later tho) and its time window (seconds field), conditions and variables its being evaluated against. There is also a name field that's optional. I'm unclear whether the "same Limit" (i.e. its seconds, conditions and variables field) would be able to exist within a Namespace under different names… In any case, when all discriminators have been applied to narrow how to limit access to a route, if multiple Limits are found, the "smaller one" would prevail… as it'd be the first to trip the condition of not being counter_is_within_limits.

Probably clearly defining what uniquely identifies a Limit would be the first step. And then clearly codify this within the domain model of Limitador (i.e. Namespace contains Limits which have Counters - today, that relationship is "only" explicit within the InMemoryStorage as HashMap<Namespace, HashMap<Limit, HashSet<Counter>>>>). So that a Namespace.Set<Limit> would only ever contain valid and applicable Limits.

Since the domain model isn't self-sufficient, but relies on some Storage implementation to "obey" by these rules, the other implementations do require a little more effort. In the case of Redis we'd ideally change the key_for_limits_of_namespace Set to a Map of these serialized_limit_definition to their non-identity fields (max_value and nullable name?), so to provide uniqueness of Limits within a namespace at the "right level".

How did these two "versions" of the "same" limit even make it in?

Given Limitador is being configured using the "config file", there is code that does "only keep" the intersection of existing (i.e. already in Redis) Limits and the ones in the config file (without resetting their counters), while non-existing one within Redis are created based of the config file. So that should not be the source of the state as observed in this issue.

Could it be some Limitador instance was configured in way so to have the two Limits defined? The first entry added and Limitador got restarted before the previous one was removed?

Changes

The behavior should be change for (at least) these tests to pass:

```rust #[test] fn limit_equality_hash_contract_excludes_max_value() { // what about `Option name`? let namespace = "test_namespace"; let period = 60; let conditions = vec!["req.method == GET"]; let variables = vec!["app_id"]; let limit_one = Limit::new(namespace, 10, period, conditions.clone(), variables.clone()); let limit_two = Limit::new(namespace, 20, period, conditions.clone(), variables.clone()); assert_eq!(limit_one, limit_two); let mut set = HashSet::new(); set.insert(limit_one); set.insert(limit_two); assert_eq!(set.len(), 1); } async fn add_only_supports_one_logical_limit(rate_limiter: &mut TestsLimiter) { let namespace = "test_namespace"; let period = 60; let conditons = vec!["req.method == GET"]; let variables = vec!["app_id"]; let limit_one = Limit::new(namespace, 10, period, conditons.clone(), variables.clone()); let limit_two = Limit::new(namespace, 20, period, conditons, variables); for limit in [&limit_one, &limit_two].iter() { rate_limiter.add_limit(limit).await.unwrap(); // second `add` would _not_ add! } let limits = rate_limiter.get_limits(namespace).await.unwrap(); assert_eq!(limits.len(), 1); assert_eq!(limits.iter().next().unwrap().max_value(), 10); assert!(limits.contains(&limit_one)); assert!(limits.contains(&limit_two)); } async fn configure_with_deletes_all_except_the_limits_given(rate_limiter: &mut TestsLimiter) { let namespace = "test_namespace"; let limit_to_be_kept = Limit::new(namespace, 10, 60, vec!["req.method == GET"], vec!["app_id"]); let limit_to_be_deleted = Limit::new(namespace, 200, 3600, vec!["req.method == GET"], vec!["app_id"]); // limit_to_be_kept and limit_to_be_deleted now have a different definition, // beyond their max_value. they apply to different time windows for limit in [&limit_to_be_kept, &limit_to_be_deleted].iter() { rate_limiter.add_limit(limit).await.unwrap(); } rate_limiter .configure_with(vec![limit_to_be_kept.clone()]) .await .unwrap(); let limits = rate_limiter.get_limits(namespace).await.unwrap(); assert!(limits.contains(&limit_to_be_kept)); assert!(!limits.contains(&limit_to_be_deleted)); assert_eq!(limits.len(), 1); } ```

eguzki commented 2 years ago

@KevFan not related to the issue, I think the limits objects are not correctly defined.

 {
    "namespace": "apicast-ratelimit",
    "max_value": 6945,
    "seconds": 60,
    "name": null,
    "conditions": [
      "generic_key == slowpath"
    ],
    "variables": [
      "generic_key"
    ]
  }

generic_key descriptor entry key should not be in variables. variables is meant to define one counter per each value of the descriptor key defined in there. However, at the same time in conditions the limit is constrained to "generic_key == slowpath", Thus, the generic_key in variables is useless. It should be something like this:

 {
    "namespace": "apicast-ratelimit",
    "max_value": 6945,
    "seconds": 60,
    "name": null,
    "conditions": [
      "generic_key == slowpath"
    ],
    "variables": []
  }

eguzki commented 2 years ago

I have been trying to reproduce the issue with no luck. Definitely there is an issue, but we need more insight because with the given info, I cannot reproduce the issue.

steps to reproduce

Deploy local kind cluster

kind create cluster --name limitador-system --config kind-cluster.yaml

Deploy redis in the default namespace

kubectl apply -f redis-deployment.yaml

Content of redis-deployment.yaml

```yaml --- apiVersion: apps/v1 kind: Deployment metadata: name: redist labels: app: redis spec: selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis --- apiVersion: v1 kind: Service metadata: name: redis spec: selector: app: redis ports: - name: redis port: 6379 protocol: TCP targetPort: 6379 ```

Deploy limitador in the default namespace with a limits file

kubectl apply -f limitador-deployment.yaml

Content of limitador-deployment.yaml

```yaml --- apiVersion: v1 kind: ConfigMap metadata: name: limits-file data: limits-file.yaml: | - namespace: test_namespace max_value: 10 seconds: 60 conditions: - "req.method == GET" variables: - user_id --- apiVersion: apps/v1 kind: Deployment metadata: name: limitador labels: app: limitador spec: selector: matchLabels: app: limitador template: metadata: labels: app: limitador spec: containers: - image: quay.io/3scale/limitador:v0.5.1 name: limitador env: - name: RUST_LOG value: debug - name: REDIS_URL value: "redis://redis" - name: LIMITS_FILE value: "/tmp/limitador/limits-file.yaml" volumeMounts: - name: limits-file mountPath: /tmp/limitador volumes: - name: limits-file configMap: items: - key: limits-file.yaml path: limits-file.yaml name: limits-file --- apiVersion: v1 kind: Service metadata: name: limitador spec: selector: app: limitador ports: - name: admin port: 8080 protocol: TCP targetPort: 8080 ```

Limits file initially

    - namespace: test_namespace
      max_value: 10
      seconds: 60
      conditions:
        - "req.method == GET"
      variables:
        - user_id

Expose limitador API locally

k port-forward service/limitador -n default 8080:8080

Get limits

curl http://127.0.0.1:8080/limits/test_namespace 2>/dev/null  | yq e -P
- namespace: test_namespace
  max_value: 10
  seconds: 60
  name: null
  conditions:
    - req.method == GET
  variables:
    - user_id

Update config map, only max_value from 10 to 20

cat <<EOF >new_limits.yaml
> - namespace: test_namespace
>   max_value: 20
>   seconds: 60
>   conditions:
>   - "req.method == GET"
>   variables:
>   - user_id
> EOF

k create configmap limits-file --from-file=limits-file.yaml=new_limits.yaml -o yaml --dry-run=client | k replace -f -

If you try to inspect limits, max_value is still 10

curl http://127.0.0.1:8080/limits/test_namespace 2>/dev/null  | yq e -P
- namespace: test_namespace
  max_value: 10
  seconds: 60
  name: null
  conditions:
    - req.method == GET
  variables:
    - user_id

Rollout limitador to read the new configmap

k rollout restart deployment/limitador

Limitador gets restarted

k get pods
NAME                        READY   STATUS    RESTARTS   AGE
limitador-9dc4879c4-45nr6   1/1     Running   0          7s    <-- new born! 
redist-8464b6fbc9-74gtx     1/1     Running   0          14m

you may need to restart port-forward command

k port-forward service/limitador -n default 8080:8080

Get the new limits. Only one limit and the max_value has been replaced with new value 20

curl http://127.0.0.1:8080/limits/test_namespace 2>/dev/null  | yq e -P
- namespace: test_namespace
  max_value: 20
  seconds: 60
  name: null
  conditions:
    - req.method == GET
  variables:
    - user_id

That tells me that Limitador is working as expected. And also tells me that the current implementation defining what uniquely identifies a Limit may be right. If we removed the max_value from the PartialEq or Hash, the very same process described above would end up not changing anything. I think that the customer would expect that if you change the max_value, the limit should be replaced.

Internally the existing limit is deleted and the new one created. Effectively, the counter has been reset and I also think that this is somewhat expected. Maybe not ideal, but expected. Changing the max_value and keep the counter as it is may be nice to have, especially true for long sized time periods like yearly or monthly.

I do not know if keeping the counter is feasible. I need to learn more about Limitador's codebase. I see basic methods to get_limits, add_limit, delete limit. But I do not see update_limit.

@alexsnaps I would love your feedback

KevFan commented 2 years ago

@eguzki Oh okay, thanks, we'll update that :+1:

Yeah, I think this is likely an edge case that is difficult to replicate. In particular, RHOAM customer installs use external redis instances. In the OHSS ticket, interesting it notes that the elasticache redis instance had a pending alert firing about it's availability for a period. This may have been a factor to this edge case. Perhaps it was unable to delete the current limits before reloading the limits from the file. But this is just me guessing :thinking:

alexsnaps commented 2 years ago

That tells me that Limitador is working as expected. And also tells me that the current implementation defining what uniquely identifies a Limit may be right. If we removed the max_value from the PartialEq or Hash, the very same process described above would end up not changing anything. I think that the customer would expect that if you change the max_value, the limit should be replaced.

Right, more would need to be done to update the limit. In that scenario, we could "just" update the max value on all Limits to be kept when configure_with is invoked. Also, as mentioned on Slack we'd ideally need to implement the same contract in Redis (that currently enforces the Set semantic on identity based off the serialized form), but that require a change in the format we persistent things into Redis. Or we could keep that discrepancy between the two storage model and enforce uniqueness at only "in-memory" (i.e. not in the InMemoryStorage, but on the hydrated domain instances, in this case at the Namespace level?). That'd save us from requiring people to delete their entire Redis' state when upgrading to this newer version.

I do not know if keeping the counter is feasible

We could totally keep the counters around in that case (doing the above would actually yield that very result). That being said, I think possibly revamping the domain entirely might be desirable. I opened #70 for us to discuss these issues. There are multiple improvements we could achieve by slowly introducing these changes (e.g. higher concurrency, less blocking, less memory consumption, ...) and I think coming up with a self contained domain model might be desirable (i.e. one that doesn't even let you represent illegal state). How to best model that tho is open for debate. Also, decoupling the persistence (as in to Redis) and possibly the REST API JSON could also be beneficial depending on the domain model we end up defining.

tl;dr we might just want to close this and start the discussion on the data modeling as part of #70 ?

eguzki commented 2 years ago

Looking forward to that

alexsnaps commented 2 years ago

See #88

Kuadrant / limitador