clarkmcc / cel-rust

Common Expression Language interpreter written in Rust
https://crates.io/crates/cel-interpreter
MIT License
375 stars 21 forks source link

Add `Value::Dynamic` variant #92

Open Caellian opened 2 weeks ago

Caellian commented 2 weeks ago

It would be a nice improvement if lazy values were supported (like std::cell::LazyCell).

These variables would evaluate their value only when used, at which point they'd turn into some other Value::Variant.

Example where this would be useful is having variables bind to an external API and evaluating all of them if they're not used is expensive.

There's ways of achieving this functionality already (with functions), but it would be nice if users could refer to them with member.syntax.

clarkmcc commented 2 weeks ago

This sounds like a similar solution to the problem proposed here https://github.com/clarkmcc/cel-rust/issues/43 and would possible be solved by the proposed trait-based value design here https://github.com/clarkmcc/cel-rust/issues/76. Does that sound right or are the additional pros/cons to this approach over the approaches discussed in those issues?

Caellian commented 2 weeks ago

43 proposal only works on root, having a Lazy variant would allow for every level of github.orgs['clarkmcc'].projects['cel-rust'].license to be lazy. Another difference is that cel-rust wouldn't allow values to change during expression evaluation. Resolver approach allows someone to use random which means that hello == hello could evaluate to false.

@chirino, what's your oppinion on this?

Same linear time evaluation remark as you've noted in #47 applies here. But it's not something I'd adhere to at all costs because it's not a good metric. In this context, constant time means evaluating expression always takes 1000ms because of non-negotiable upfront cost, while variable complexity would be 20ms-1000ms based on which fields are being read; if an HTTP request has to happen, it's still got to happen, be it handled by registered handler or code that uses cel-rust.

alexsnaps commented 2 weeks ago

I've been wondering whether the laziness should really be applied at interpreter time. Obviously that's the only way to avoid populating unnecessary state, but in CEL that's relatively rare. So, wouldn't it be more sensible to used the parsed expression to resolve all Idents being referenced and their members being navigated and populate a Map instead? member.syntax lets you navigate a Map just as it does for a String, no differentiation there, is there? So it'd be transparent to the expression author... If that assumption holds, and that's how I will implement it for now in any case, that let one populate the evaluation's context with the minimal set of state required, e.g. github.orgs['clarkmcc'].projects['cel-rust'].license, populates a Map bound to github, containing an entry keyed orgs with a Map containing ... Not sure I'd resolve the key lookups in that example, but request.method wouldn't force me to populate the whole request (with headers, ...) but only the method, made accessible thru the request: Map<String, String> in this case.

clarkmcc commented 2 weeks ago

@alexsnaps are you proposing that there just be an additional step where we clone & convert types into CEL values prior to running the interpreter? This is a crude, non-working example but just want to make sure I understand what you're suggesting.

let mut context = Context::default();
let values = HashMap::from([("github", ...)])
let program = Program::compile("github.orgs['clarkmcc'].projects['cel-rust'].license == true");
program.resolve_or_something(&mut context, values).unwrap(); // <- 
let result = program.execute(&context);
alexsnaps commented 2 weeks ago

Yeah, something from the crate could help, but I'm (literally right now started) doing it outside of it... Something analogous to Expression::_references, where it'd find all "member path", so that something along side ['GET', 'POST'].contains(context.request.request), would resolve in an entry in the Vec of paths: ['context', 'request', 'method']. From there I now create a Context with the a root binding, resolving the Ident context (ctx.add_variable_from_value), of type Map that contains String key request resolving the Attribute, that itself has a value of type Value::Map that contains method. I resolve what this all means to my domain's model and inject the Value::String in the Map. Now the members are operating on a Map instead of a lazy populated Struct, but I think this wouldn't make a difference to the author of the expression.

alexsnaps commented 2 weeks ago

Still WIP, but here is a passing test for what this would look like... or at least the first part of it, i.e. finding all the paths. I'm working on populating that Value::Map now with only the members required by the expression. Hope this clarifies. Dunno if that'd suffice for other use-cases tho... would love to hear what this does not cover.

Caellian commented 2 weeks ago

member.syntax lets you navigate a Map just as it does for a String, no differentiation there, is there?

Depends how you look at it. Strictly speaking no, because string has no members. But after parsing, if you're not using a literal, you'll have identifier which needs to be resolved.

Obviously that's the only way to avoid populating unnecessary state, but in CEL that's relatively rare.

I think the cost of 1 extra enum variant (maybe a larger jump/cache miss?), isn't really that severe for how much it simplifies the code I'm working on.

So, wouldn't it be more sensible to used the parsed expression to resolve all Idents being referenced and their members being navigated and populate a Map instead?

let one populate the evaluation's context with the minimal set of state required, e.g. github.orgs['clarkmcc'].projects['cel-rust'].license, populates a Map bound to github, containing an entry keyed orgs with a Map containing ...

What benefits does it offer over having lazy values? You still do the same thing, only now with 5 extra steps, 1 extra place where you need to transcribe data structs into code, 1 additional dependency, and more redundant compute/work.

Each place you'd try running a CEL Program, would need to:

I personally don't want to do that, it's just a lazy value, why would I need to handle parsing and expression traversal just to run author.role() and store the result in a Map. I chose CEL because I don't want to write an expression parser and interpreter, not because I want to manually handle lexing of tokenized strings.

Ideally, you'd construct Context once before you evaluate any CEL expressions, and re-use (mostly) the same Context for everything, optionally adding root level values where it makes sense.

It's also unsafe to assume identifiers have any meaning without Context, assuming they do before constructing Context will lead to issues because you now have 2 places that have to take care of identifier semantics. It offers no additional benefit.

alexsnaps commented 2 weeks ago

member.syntax lets you navigate a Map just as it does for a String, no differentiation there, is there?

Depends how you look at it. Strictly speaking no, because string has no members. But after parsing, if you're not using a literal, you'll have identifier which needs to be resolved.

Haha! Obviously, my bad, I meant Struct not String 🤦

Obviously that's the only way to avoid populating unnecessary state, but in CEL that's relatively rare.

I think the cost of 1 extra enum variant (maybe a larger jump/cache miss?), isn't really that severe for how much it simplifies the code I'm working on.

Are you talking about the Value enum here? Having this Value::Lazy? What would that variant'sValueType be?

Each place you'd try running a CEL Program, would need to:

Can you elaborate here what the use case is? I.e. what's changing the most, the CEL expressions themselves? Or the Context on which they operate? I'm making the assumption the latter, but now I'm hesitating based of what you're saying. Also, is your "application" (whatever evaluates the expression), aware of the semantics of the Context or are these two decoupled?

I chose CEL because I don't want to write an expression parser and interpreter, not because I want to manually handle lexing of tokenized strings.

Did you look at the "reference" implementation, how do you think that'd work there? I'm wondering how for instance this could be make to work when injecting protobuf data in the context? Which relates to a few of the things that have been discussed here wrt Serde et al...

Anyways, a bit like how Serde deals with Arc fields on structs, I'm thinking dealing with OnceCell might be a thing. i.e. linking a rust Struct to the context, where OnceCell would inited ... unsure how tho, given the API 🤔

Caellian commented 1 week ago

Are you talking about the Value enum here? Having this Value::Lazy? What would that variant'sValueType be?

Returned from internally stored variable, that's inferred from result type of the provided function:

pub struct LazyValue {
  eval: Box<dyn ErasedProducer>,
  value_type: ValueType,
  current: Option<Value>,
}

impl LazyValue {
  pub fn new<R>(ctor: impl Into<Producer<R>>);
  pub fn value_type(&self) -> ValueType;
  pub fn reset(&mut self);
}

// Incomplete type projection magic:
type Producer<R> = Box<dyn Fn(&Value) -> R>;
trait ErasedProducer;
impl ErasedProducer for Fn(&Value) -> impl ResultType;
trait ResultType; // projects `R` into `ValueType` stored in `ErasedProducer`
impl ResultType for Into<Value>;

Calling ctx.add_lazy_variable("name", impl Into<Producer<R>>) directly constructs Value::Lazy(LazyValue).

Did you look at the "reference" implementation, how do you think that'd work there?

I assume it would use something similar to above, and then use the reflect module to determine ValueType from function. In some ways a Go implementation would be simpler than Rust one because reflect exists so compile time type projection isn't necessary.

I'm wondering how for instance this could be make to work when injecting protobuf data in the context?

Value::Lazy is internal detail. Only the library and Rust code is aware of it. Protobuf data would be deserialized into other (existing) types and not Lazy because lazy is intended only for manually assigned variables.

I'm not familiar with protobuf though, not sure if it has a similar concept, but if it does, then it's not possible to construct Value::Lazy only if the return type of a lazy function can't be inferred.

chirino commented 1 week ago

If I have map with millions of keys, and I want to avoid converting all those keys to Cel values, would this lazy idea help? Or do you still end up inserting millions of lazy entries into the cel context?

alexsnaps commented 1 week ago

Did you look at the "reference" implementation, how do you think that'd work there?

I assume it would use something similar to above, and then use the reflect module to determine ValueType from function. In some ways a Go implementation would be simpler than Rust one because reflect exists so compile time type projection isn't necessary.

My question was unclear here, I was wondering how (if at all) one would address that with the golang implementation without changing it. The golang version is much more complete than the Rust one here...

Caellian commented 1 week ago

chirino: If I have map with millions of keys, and I want to avoid converting all those keys to Cel values, would this lazy idea help? Or do you still end up inserting millions of lazy entries into the cel context?

For the most part no. It's only useful for cases like the one you show in #43 where you'd use a jump table for lookup in resolver. If the data is fully dynamic (e.g. value of each field is field_name + "_field" and any field identifier is valid), then this can't represent it. If the entry cound is finite, and most entries are larger than the size of the LazyValue then it would help, but it's still not the best solution.

This mostly addresses high evaluation cost members might carry, not as much Map length/memory limitations.

I'd personally approach dynamic members via a trait (trait from #76, example impl in #96) - it's not dissimilar to #58, the difference is it uses a trait that can be inserted at any level of the tree structure and the parent of identifier is type-checked.

If both concepts were implemented, I see myself using both.

alexsnaps: My question was unclear here, I was wondering how (if at all) one would address that with the golang implementation without changing it. The golang version is much more complete than the Rust one here...

That's not a question then, but a statement: "I don't think this is a good suggestion because it deviates from reference implementation."

I agree, it does. It also doesn't interfere with existing functionality/spec. Spec expects all data present in CEL to be precomputed (in order to serialize it with protobuf), which isn't ideal in some use cases. The example you provided includes an additional lexing pass just to collect that data in advance. If fetching expensive data is common enough, it makes more sense to provide an ergonomic way to do it easily instead of expecting users to handle constructing dynamic data structures based on used identifiers in the expression (because that will be slower 90% of the time).

My requirements differ enough that I'm using a fork of this library, but I still see no reason why this is a bad idea. And "reference implementation doesn't have it" isn't a good reason IMO (see Weston). It can be controlled with a feature in case you need no-alloc or have a kink for unsafely parsing all your CEL expressions to check for requirements.

alexsnaps commented 1 week ago

alexsnaps: My question was unclear here, I was wondering how (if at all) one would address that with the golang implementation without changing it. The golang version is much more complete than the Rust one here...

That's not a question then, but a statement: "I don't think this is a good suggestion because it deviates from reference implementation."

It is a question. I really don't know if that is feasible in a way or another. I was not arguing that it shouldn't be done because the reference implementation doesn't do it. I'd argue that out of the box, it'd be desirable for this crate to behave as the golang implementation does, if only to provide portability of CEL expressions (and their behavior, including in terms of data accesses) across the two implementations. But as you rightfully mentioned, things can be extended in non "compliant" ways (through a feature flag, which has been discussed even in the golang implementation already) or made possible through the introductions of certain hooks and/or SPIs.

So from your answer tho, I guess you do know this isn't feasible with the golang implementation neither, is that right?

My requirements differ enough

This is where it's hard to understand where you are coming from, given how little you share about the problem and rather solely focus on your solution. Let me showcase: in my case, ~another lexing pass~ traversing the AST is required to gather the "property accesses", but once. We have CEL expressions that "barely change" (at least compared to how many times they'd get evaluated, i.e. millions). So that additional traversal of the AST is a relative small price to pay, as we store the result and are ready to populate the data for when evaluation time comes. Further more, planning for that retrieval would actually be faster than lazy loading it, as we can now implement a reasonable fetch plan for the data and get it in bulk. All that not to disqualify your use case, very much the opposite as understanding the choice for the tradeoffs might help and do realize we aren't all solving the same problems, but to highlight how the actual use case matters.

Caellian commented 1 week ago

It is a question. I really don't know if that is feasible in a way or another. I was not arguing that it shouldn't be done because the reference implementation doesn't do it.

Sorry, it seemed like a loaded question.

So from your answer tho, I guess you do know this isn't feasible with the golang implementation neither, is that right?

I'm not the best person to ask (as I've said before) as I've never touched protobufs and have very limited experience with golang. It's not currently supported, but it is feasible.

Looking at the reference implementation, [`google.protobuf.Any`](https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/any.proto) can store ~`std::any::Any` (i.e. any type with bounds `Send + Sync + 'static`). Per-spec, this value type is used for types like `Duration` and `Timestamp`, a `LazyValue` would likely have to do the same. Accidentally, what I did in #96 basically adds `google.protobuf.Any`, only I called it `ExternalValue`. The struct suggested in this issue also has erased types to allow storing it in `Value` enum, so it could also be stored under `ExternalValue` if need be (as a non-abstract object). The issue with representation is that a `LazyValue` effectively stores a function that needs to be ran by interpreter in order for the interpreter to know what its "replacement" is. Theoretically, it is possible to send machine code via protobuf albeit not the best idea. Alternatively, some self-contained binary format could be used (e.g. wasm), or something interpreted. The specification and reference implementation only deals with raw data that can be communicated with protobuf however, so even if `LazyValue` were sent into it, it would show up only as a byte array. That's why I said there's no way to actually send `LazyValue` and it has to be set programmatically - that way it will be handled by cel-rust automatically, and not show up as a `{ code: Vec, value: Option }`.

This is where it's hard to understand where you are coming from, given how little you share about the problem and rather solely focus on your solution.

Oh, pfff, me? Well, uhm... 👉👈

I'm, of course, working on a ✨ground breaking✨ labeling action for GitHub. Randomly stumbled across this crate to parse expressions like:

- if: commit.body.has_word("Wayland") && !commit.body.has_word("X11")
  then:
    - add_label("wayland")
    - add_to_milestone("Wayland")

- if: comment.matches("$assign: @(\w+)") && user.role == "owner"
  then:
    - assign(matches[1])

And now I'm pushing for changes on this project - don't you love FOSS?! /j

In my case I have a single file that's used by multiple contexts - if Program fails because identifier isn't defined, the rule is skipped. I could do something similar to what you're doing and sort different rule entries based on context they require, but catching ExecutionError::UndeclaredReference is much simpler for now. Yes, it could cause unnecessary requests as-is, the forked version will check root-level identifiers first and maybe even allow grouping that way, but that's out of scope for now, I just needed #96 and created issues for what I felt was missing.

So I don't use protobuf, I simply need something that will evaluate expressions and don't feel like writing everything from scratch.

I could likely do batching as well if I switched to GitHub GraphQL API, but that feels like more trouble than it's worth atm. Anyway, user.role is not provided in initial workflow JSON, and could cause 2 additional requests to be sent. Maybe in some cases it won't even be used so I don't want to gather data upfront and send 20 requests to build the Context. I have a single Context construction function, and then add root level variables to it based on event that triggered the action (comment, PR, issue create, edit, etc.). user is effectively an ID wrapper.

alexsnaps commented 1 week ago

Thanks for sharing what the context of your problem!

In the meantime, I've looked up the golang implementation a little further, and I wonder if we're not missing blocks that would actually let us decouple the process of hydrating the whole context upon creation from the crate. Still investigating this and might even start some exploratory work, but I think here are the blocks we miss:

But wondering for now, if adding a Value::Struct that would sit as a "proxy" between the interpreter and the actual data, so that accessing a field of that Value::Struct wouldn't require the whole struct to be hydrated, but would only resolve the accessed member, just as for Value::Map. Tho, per spec those structs are to be converted to Maps and with today's of Value::Map which holds a ... HashMap<Key, Value>, well it would need to be fully populated.

A bit of a rambling, but seeing how the reference implementation does it, i.e. having an open set of Values... this would be a much bigger undertaking. Maybe have a DynValue?

Caellian commented 1 week ago
  • Some abstract value representation. [...], adding a Value::Struct that would sit as a "proxy" between the interpreter and the actual data, so that accessing a field of that Value::Struct wouldn't require the whole struct to be hydrated, but would only resolve the accessed member, just as for Value::Map. Tho, per spec those structs are to be converted to Maps and with today's of Value::Map which holds a ... HashMap<Key, Value>, well it would need to be fully populated.

96 +/- minor tweaks, does precisely that. And allows for things that aren't sanctioned by the spec such as DynamicValue == "String" && DynamicValue > 5.

DynType, for data with an unknown layout before evaluation.

This sounds more or less like what I want:

pub struct DynValue {
  eval: Box<Fn(&Context) -> Value>,
  current: RefCell<Option<Value>>,
}

impl DynValue {
  pub fn new<F>(ctor: F) where F: Fn(&Context) -> Value;
  // simplified - without explicit RefCell borrowing
  pub fn value_type(&self, ctx: &Context) -> ValueType {
    if let Some(current) = self.current {
      current.value_type()
    } else {
      let value = (self.eval)(ctx);
      self.current = Some(value);
      value.value_type()
    }
  }
  #[cfg(feature="non_spec_compliant")]
  pub fn reset(&mut self);
}

"unknown layout before evaluation" is basically what I'm pushing for (I think). Layout is determined by type, so something without a layout can conceptually be:

Other than that, unsized types ([T]) can't be stored, so I think it's more practical to consider the options that are dynamic when observed from runtime. Having the Value be determined by a function seems like better design to me as it's then uniquely owned by evaluation engine and can't be Dropped while in use or cause other weirdness.

I'm repurposing this issue for DynValue then.

clarkmcc commented 1 week ago

Howdy everyone, sorry I was on vacation last week and am just catching up on this. Let me summarize and see if I'm following.

There are two general proposals:

  1. Improve the user experience to manually hydrate the context based on what's referenced in the expression.
  2. Some sort of general value trait or DynValue that could allow for injecting the Value into the context at runtime, but may or may not be spec compliant, and may or may not prevent us from implementing protobuf support https://github.com/clarkmcc/cel-rust/issues/80.

Am I understanding correctly where the discussion has landed?

I need to go brush up on DynValue in the Go implementation and get a better handle on how that works, and what it means for #80.

alexsnaps commented 1 week ago

I think option 2 can absolutely be made compliant with the spec, as this is how the golang impl does gets it's input ref.Val:

```golang // Val interface defines the functions supported by all expression values. // Val implementations may specialize the behavior of the value through the addition of traits. type Val interface { // ConvertToNative converts the Value to a native Go struct according to the // reflected type description, or error if the conversion is not feasible. // // The ConvertToNative method is intended to be used to support conversion between CEL types // and native types during object creation expressions or by clients who need to adapt the, // returned CEL value into an equivalent Go value instance. // // When implementing or using ConvertToNative, the following guidelines apply: // - Use ConvertToNative when marshalling CEL evaluation results to native types. // - Do not use ConvertToNative within CEL extension functions. // - Document whether your implementation supports non-CEL field types, such as Go or Protobuf. ConvertToNative(typeDesc reflect.Type) (any, error) // ConvertToType supports type conversions between CEL value types supported by the expression language. ConvertToType(typeValue Type) Val // Equal returns true if the `other` value has the same type and content as the implementing struct. Equal(other Val) Val // Type returns the TypeValue of the value. Type() Type // Value returns the raw value of the instance which may not be directly compatible with the expression // language types. Value() any } ```

Activation is another mechanism, and mapActivation literally talks about lazily supplying the values at eval time:

```golang // NewActivation returns an activation based on a map-based binding where the map keys are // expected to be qualified names used with ResolveName calls. // // The input `bindings` may either be of type `Activation` or `map[string]any`. // // Lazy bindings may be supplied within the map-based input in either of the following forms: // - func() any // - func() ref.Val // // The output of the lazy binding will overwrite the variable reference in the internal map. // // Values which are not represented as ref.Val types on input may be adapted to a ref.Val using // the types.Adapter configured in the environment. func NewActivation(bindings any) (Activation, error) { if bindings == nil { return nil, errors.New("bindings must be non-nil") } a, isActivation := bindings.(Activation) if isActivation { return a, nil } m, isMap := bindings.(map[string]any) if !isMap { return nil, fmt.Errorf( "activation input must be an activation or map[string]interface: got %T", bindings) } return &mapActivation{bindings: m}, nil } ```