Defaults & Overrides (RFC 0009)

Kuadrant / architecture

Architecture Documents

0 stars 10 forks source link

Defaults & Overrides (RFC 0009) #58

Closed guicassolato closed 6 months ago

guicassolato commented 8 months ago

RFC: Defaults & Overrides (Epic issue: Kuadrant/kuadrant-operator#431)

Proposes the extension of the Kuadrant Policy APIs so we support use cases of Defaults & Overrides (D/O) for Inherited Policies, including the following base use cases:

Defaults: policies set at a lower level in the hierarchy supersede ones set (as "defaults") at a higher level, or "more specific beats less specific"
Overrides: policies set at a higher level in the hierarchy supersede ones set at the lower levels, or "less specific beats more specific"

As well as the following derivative cases:

Merged defaults: default policy rules that are merged into the more specific policies (as opposed to an atomic less specific set of rules that is activated only when another more specific one is absent)
Merged overrides: override policy rules that are merged into the more specific policies (as opposed to an atomic less specific set of rules that is activated fully replacing another more specific one that is present)
Constraints: specialization of an override that rather than declaring concrete values, specify constraints for values – typically numeric values and regular patterns (e.g. limited sets) – declared at the lower levels, that is used to "clip" the requested specific values within the boundaries dictated by the constraint, in an override fashion – e.g.: min value, max value, in operator.
Deactivation: specialization that completes a merge default use case by allowing lower level policies to disable ("deactivate") individual defaults set a higher level (as opposed to superseding those defaults with actual more specific policy rules with meaning)

Out of scope:

Requirements: high level policies that declare requirements to be fulfilled by more specific (lower level) policies without specifying concrete default or override values or constraints. E.g.: "an authentication policy must be enforced, but none is provided by default."

Affected APIs:

AuthPolicy
RateLimitPolicy

Non-affected APIs, while these are considered Direct Policies, i.e. with no hierarchical effect:

DNSPolicy
TLSPolicy

thomasmaas commented 8 months ago

Couple of questions:

GEP Examples use singular default and override. The examples in this pr use plural. On purpose?
GEP Examples seem to always use override and default keywords while in the examples in this PR all policies targeting a http route don't use any keyword (and are thus direct policies?). On purpose?

guicassolato commented 8 months ago

Couple of questions:

1. [GEP Examples](https://gateway-api.sigs.k8s.io/geps/gep-713/#inherited-policy-attachment-its-all-about-the-defaults-and-overrides) use singular `default` and `override`. The examples in this pr use plural. On purpose?

2. GEP Examples seem to always use `override` and `default` keywords while in the examples in this PR all policies targeting a http route don't use any keyword (and are thus direct policies?). On purpose?

No, it was not intentional, but because I've actually seen the GEP using both ways. Maybe not in the YAML examples, but throughout the text. For example, where it says:

Settings in overrides stanzas will win over the same setting in a defaults stanza.

Either way, thanks for pointing out. Happy yo change to singular form.

guicassolato commented 8 months ago

Even in the examples actually. Try searching for "overrides:" @thomasmaas. You'll find plural form, for example, in this section of the GEP: https://gateway-api.sigs.k8s.io/geps/gep-713/#inherited-policy-attachment

youngnick commented 8 months ago

Reading over this, and one relevant change that I'm working on in splitting out Direct Attached Policy and Inherited Policy into their own GEPs is this:

I'm changing the definition of Direct and Inherited Policies a little bit.

Direct Policies now target only one object and only affect that object. Before there was some leeway where you could have specified a policy without defaults or overrides at, for example, the Gateway level, and had that affect Routes, but it still basically counted as a Direct Policy.

Now, Inherited Policies are any policy that has effects outside of the resource that it's attached to, regardless of whether there are defaults or overrides stanzas present.

I don't think that anything I'm seeing in this doc contradicts that, but just so you know.

I also really like the use of the word "constraints" for the class of settings that don't directly default or override, but place bounds around possible values. If you don't mind, I'm going to steal that and put it in the Inherited Policy GEP once it's split out.

alexsnaps commented 8 months ago

With the proposal of adding support for CEL expressions in when clauses, and if we really open policies to CEL in general, I'm wondering if we couldn't address much of this proposal (strategy, constraints, ...) with supporting any node in the policy tree to be either T or fn(T): T, where fn would be a CEL expression(s).

In order to achieve that we'd need a way to "escape" into such a CEL expression... I'm unsure if this has been done before or if there are well established patterns to do so, but I'll just use the cel: prefix to a string to identify such "escaping" in the quick examples below. An atomic merge strategy would be the default... in order to merge, e.g. add a key/value to a map, one would escape to cel and "just" add the entry to the map: map_typed_key: cel:value["a"] = my_added_value_for_a... We'd define the binding's name for the node upfront (in this example value), but could be anything. And this example my_added_value_for_a would need to be properly constructed to be a valid entry value for this map. We could force a return to the function, or "just" reuse the binding passed in (in this case, value) to reassign to the policy's tree node.

Now that entry could be conditionally added, e.g. only if it is missing. A CEL expression could also validate that all entries in a map e.g. for rate limits are within certain bounds and clip/clamp it accordingly... effectively acting as constraints on the policy.

While I think this simplifies, I also fear it might open the door to too many complex cases, given the power CEL would add at any given node of the tree that expresses a Policy. Arguably it would only be evaluated once, when a policy is created/modified. Also, we could absolutely restrict what a CEL expression can and cannot do initially and only slowly expand the usage further, with new releases as the need/usecases emerge.

tl;dr I'm torn... this feels like a simple enough idea to cover all of the usecases one can think of, but also feels like this is too much power. But I couldn't talk myself out of it for the last week thinking about it. Anything I miss that makes this a real bad idea?

addendum: I think this would also "solve" (push on the user) a way to merge lists...

guicassolato commented 8 months ago

I'm wondering if we couldn't address much of this proposal (strategy, constraints, ...) with supporting any node in the policy tree to be either T or fn(T): T, where fn would be a CEL expression(s).

Thanks, @alexsnaps. Now that I understand the idea better, I'll try to capture it among the options by updating DESIGN OPTION 3.

The pros you've highlighted very well. I have nothing to add.

I agree with you and also believe we can workaround the con of possibly causing "unmergeable objects" to merge – not only for lists but for maps too! – by pushing on the users. They would be the ones defining the final output of the functions.

Imagining the maps of policy rules re-implemented as lists, an "atomic" set of default authentication rules could look like this:

spec:
  defaults:
    rules:
      authentication:
      - name: x
        value: 1
      - name: y
        value: 2

This would be a case of a value declared as T, not as fn(T): T yet.

I still have one problem with it though. What is the key and what is the value? What if we the policy declared an authorization block as well? Would authentication and authorization be 2 keys or would rules be the one and only key (and authentication and authorization part of the value?)

Maybe assuming the values are always the scalars and the lists, and that the strategy is always a merge...

An equivalent example to the above, but with dynamic value, i.e. using fn(T): T instead of T:

spec:
  defaults:
    rules:
      authentication: |
        cel:[Authentication{name: "x", value: 1},Authentication{name: "y", value: 2}]

Of course, this second example shows the need for having a lib that defines a protobuf message Authentication.

A default that adds an authentication rule if it's not defined in the target would probably have to be declared as an override, I reckon:

spec:
  overrides:
    rules:
      authentication: |
        cel:self.exists_one(a, a.name == 'x') ? self : self + [Authentication{name: "x", value: 3}]

I still have my doubts about the potential "implementation nightmare." We would have to redefine tons of API types, cannot simply reuse current ones (from the APIs without D/O), plus protobuf messages for all of those, etc.

To reduce the burden, though at the expense of loosing things up a bit, we could define CEL function extensions that deal with the dynamic types. Found a few of those in https://tekton.dev/docs/triggers/cel_expressions/ (e.g. parseJSON(), parseYAML.)

My last thought here regards the cognitive complexity for one who reads out a policy to mentally work out all the function outputs. I think this is a point you've raised offline as well, @alexsnaps, but please correct me if I'm wrong.

youngnick commented 7 months ago

I am strongly -1 on any Kubernetes object using an inline function as a value. Sure, this is super flexible, but, as mentioned, both the cognitive overhead for human creators and readers of the policy, and implementation overhead for whoever's implementing the policy, seem terrifyingly complex to me.

To put this another way, I don't think that adding the ability to have what may end up as Turing-complete fields (if they aren't already, I haven't looked at the CEL spec for a while) is going to go well.

alexsnaps commented 7 months ago

My last thought here regards the cognitive complexity for one who reads out a policy to mentally work out all the function outputs. I think this is a point you've raised offline as well, @alexsnaps, but please correct me if I'm wrong.

I am strongly -1 on any Kubernetes object using an inline function as a value. Sure, this is super flexible, but, as mentioned, both the cognitive overhead for human creators and readers of the policy, and implementation overhead for whoever's implementing the policy, seem terrifyingly complex to me.

Yes, that was my point @guicassolato . I think the cognitive overhead might be "a little much" indeed, to say the least. If we have a tool (e.g. gwctl) that could give a user the effective policy on any object would almost be a necessity, and yet probably not even be enough to then have some reason about that how that actually came together by "just reading" the policies themselves with CEL interleaved in them.

From an implementer point of view, i.e. literally mine, I actually think this is a very elegant and actually pretty simple (not easy necessarily tho) to go about it. But don't get me wrong, I'm NOT using this as an argument to actually make it happen. In my experience, this thinking has led to more damage than good in my perspective. The one interesting aspect tho, is how much would it take away if we do not do it? And further more is it reversible (in Kuadrant obviously, not talking about any GEP here)?

To put this another way, I don't think that adding the ability to have what may end up as Turing-complete fields (if they aren't already, I haven't looked at the CEL spec for a while) is going to go well.

That, we could control. Just as validation limits the complexity of CEL expressions. Not that I believe (again), that this justifies the feature... but leaves it up for consideration.

tl;dr while I think there is much that scares me (as pointed out by all) and makes me feel very uncomfortable still, it seems to me like this is actually the "simplest" proposal (vs. complex, i.e. has the less things complected). And I still can't make up my mind for or against it... I'll be rereading the latest GEPs proposal in the meanwhile... 😞

guicassolato commented 7 months ago

I was about to replace DESIGN OPTION 3 (when conditions based on CEL expressions) with the alternative described by @alexsnaps, based on full power of CEL expressions not only to establish conditions but to determine the value to set as a default or as an override as well (mentioned at yesterday’s community call.) But, instead, I will add that as yet another alternative design, thus adding to 6 different options.

Because I’ve changed this a couple times already, including reordering the options and changing their assigned numbers, I will now abandon the numbers altogether and refer by the options unique name.

KevFan commented 7 months ago

Going forward with the recommended API DESIGN 1 - strategy field sounds good to me 👍

guicassolato commented 7 months ago

I still have a few TODOs on this:

[x] Add note about targeting a portion of an object (with and without named route rules, with and without targetRef.sectionName)
- With named route rules and with targetRef.sectionName: remove routeSelectors, enable multiple policies targeting same object as long as at different sections
- With named route rules and without targetRef.sectionName: refactor routeSelectors to use named route rules
- Without named route rule: do nothing
- [x] Add a note about changing maps of policy rules back to lists (and the importance of this for performing merges)
[x] Add a note about matching the policy spec closer to the route spec
[x] Add a note about allowing N:1 policy-target relationship (for App Devs setting D/O between them, for multiple Platform Engineers setting different policy rules on the same Gateway)
[ ] Rename strategy, atomic and merge keywords
[x] Rename deactivations' remove keyword (as unset?)
[x] Add a note about changing a default versus unsetting a default (i.e. users may want to set defaults that cannot be removed)
[ ] Provide examples as text (if just as nice as now)
[x] Add a flow diagram of the implementation logic
[x] Postpone deactivations to Tier 3
[x] Add implementation details on status and effective policy

maleck13 commented 7 months ago

@guicassolato I think proceeding with the strategy based approach initially makes sense. Do you need anything further here? Not sure I should approve at this point or wait for an update?

guicassolato commented 7 months ago

@guicassolato I think proceeding with the strategy based approach initially makes sense. Do you need anything further here? Not sure I should approve at this point or wait for an update?

@maleck13, I'm still pending on a few changes here. Just pushed one 🙂

Early next week, I'll finish those and mark the PR for review. No need to approve in the meantime, though your comments are still welcome ofc.

alexsnaps commented 6 months ago

Looks like this maybe a policy attachment recurring thing… This is huge. Key part to me tho is the tiered approach with which I think I almost completely agree with. Certainly the first one as it effectively "merely has us play by the current state of the policy attachment GEP", adding the defaults & overrides stanza and the labels & status…

Then tho… I'm not sure I want to commit to the plan from there on. I certainly have more questions (adding them to the related parts), but also considering the spec could still change, I'd propose to split this in 3 RFCs, one for each tier of the implementation. Where on the first I'd imagine everyone being ok with…