cue-lang / cue

The home of the CUE language! Validate and define text-based and dynamic configuration
https://cuelang.org
Apache License 2.0
5.15k stars 296 forks source link

Cancellable evaluation in the Go API #1104

Open slushie opened 3 years ago

slushie commented 3 years ago

Is your feature request related to a problem? Please describe. We’re using the cue Golang API in an embedded environment, where performance is critical. For our use case, we’d like to set a maximum execution time for cue.CompileBytes and friends. In this environment, we may also need to provide cancellation due to external factors eg, the cuelang source was updated out-of-band, so compilation should halt ASAP.

Describe the solution you'd like My first instinct was to have cue.Context embed context.Context so that it can support cancellation.

Describe alternatives you've considered In this environment, milliseconds count for a lot, so keeping this cancellation in-process is much preferred over eg, sending SIGALARM to the process after a timeout.

The current "best" workaround would be to allow execution to continue asynchronously (ie, in a separate goroutine) and then discard the result if the timeout has expired. This requires a couple of extra hoops to set up a second-level goroutine and signalling channel, and it burns CPU, but it generally works because the go process itself is relatively short lived. However, this also means we need to artificially limit the size of the cue source to prevent resource starvation, rather than actively canceling when resource limits are hit.

Additional context Our use case is for a very "hot path" where milliseconds equal dollars. Given this context, I'm happy to help in implementation because it would help me better understand the source 😎

myitcv commented 3 years ago

Thanks for raising this here from Slack, @slushie.

Generally speaking, CUE should always be fast, even on configurations of a Google scale. There has been a great deal of improvement in the evaluation engine to that end, but there are many performance optimisations that are still on the table. So whilst I understand your request, I think we need to spend time, at least initially, understanding why these configurations are taking time. Can you share details of configurations that are taking time to evaluate?

A somewhat related question, but if these are somewhat limited environments, is there perhaps a model whereby you run CUE outside of the environment, and push the result to the embedded environment? I ask that question with very little context, so please excuse me if I'm off the mark.

slushie commented 3 years ago

Thanks for following up @myitcv. I apologize if I framed this poorly, we're definitely not complaining about the speed of execution -- CUE's great performance is one of the reasons we chose this technology for the current project. I think due to the business demands for this use case, and because the CUE code will be written by "semi-trusted" authors, we're trying to proactively mitigate the potential for unbounded execution. To that end, I'm also curious about the big-O complexity of CUE evaluation; although understanding this is not a requirement for us to move forward.

But I don't want to get too abstract with this ticket 😅 Let me describe a bit about our use case:

The tool we're building is merging configuration from multiple authors into a single output config blob; that config consists of a structured data language that describes specific operations on Kubernetes primitives. CUE makes good sense here, because it enables our code to operate deterministically across authorship boundaries and to run hermetically in our trusted environment. In the current implementation, configuration is written in either YAML or CUE, and its evaluation returns schematized data which describes the discrete operations to be applied during merging. Configuration written in CUE can also accept arguments, whose schema is enforced by the tool through CUE.

In practice, it looks this:

// one.cue
operations: {
  add_option: foo: arguments.foo
}
arguments: {
  foo: string
}
// two.cue
operations: {
  add_option: bar: arguments.bar* | 42
}
arguments: {
  bar?: int
}
# three.yaml
add_option:
  baz: "static baz value"
remove_option:
  bop: ~
# running the tool
$ ./merge-tool --argument foo="some value"

In the first stage, we'll collect the CUE and YAML into a module, along with the input arguments passed in from the tool's caller, and then fetch the module dependencies. All arguments will be used to validate input from the caller. The output of evaluating this module will be a series of discrete operations (in the example, three additions and one removal). Then in the next stage, the tool will process each operation in an undefined order and the output will be a set of modified Kubernetes objects. We also do some validation at the end of the entire process, mostly for sanity checking at the moment.


Because the CUE code is supplied by various authors, and because we expect to perform this operation on the order of a few million times per day, we are looking to explicitly limit the runtime of the CUE evaluation step. I believe that given CUE's design characteristics (i.e. not being Turing complete), most code will safely halt evaluation within the expected time frame. Really, though, we're just being defensive against malicious or pathological cases.

Given this context, I wonder if you have any recommendations 😄 Thanks!

verdverm commented 3 years ago

@slushie, here are some issues you will find interesting:

There are users reporting multiple minutes of runtime. It's definitely possible to OOM your process. Crafting such CUE is not overly difficult. Depends on what you mean by "semi-trusted"

There are some ways to improve runtime if it gets out of hand, but they reduce reusability and/or will likely change as the evaluator improves.

myitcv commented 3 years ago

@slushie

I apologize if I framed this poorly, we're definitely not complaining about the speed of execution -- CUE's great performance is one of the reasons we chose this technology for the current project.

Absolutely no apology needed! As @verdverm pointed out, there are a number of known performance problems. So whilst it would be unfortunate if you had run into one of those, it wouldn't be altogether surprising. The good news is that we have identified the root cause of all of them and @mpvl has optimisations up his sleeve.

On that point, @verdverm I think it's worth highlighting that whilst there are some people reporting long evaluation times, we have identified there are optimisations available, some with changes in the evaluation engine, that would make these performance problems disappear. It's still true to say that performance can be affected by pathological cases of comprehensions over larges structs/lists, but that's a somewhat different problem.

@slushie to highlight one aspect of CUE that might be of interest, cue help filetypes explains how various quantifiers can be used to limit the CUE under evaluation. e.g. cue+data requires concrete input, cue+graph allows references etc. You might need/want to apply similar constraints in your situation depending on the level of trust.

The current "best" workaround would be to allow execution to continue asynchronously (ie, in a separate goroutine) and then discard the result if the timeout has expired. This requires a couple of extra hoops to set up a second-level goroutine and signalling channel, and it burns CPU, but it generally works because the go process itself is relatively short lived.

This sounds like a reasonable work around for now, I agree.

However, this also means we need to artificially limit the size of the cue source to prevent resource starvation, rather than actively canceling when resource limits are hit.

When you say "resource limits are hit" do you have something specific in mind?

This is somewhat related to discussions about the performance of CUE, especially in the context of unity, our regression and performance testing setup. Because it's far easier to measure how complex a configuration is by counting things like the number of unifications, disjunctions, etc, than relying on time. Because (at least in our limited experience) time is heavily influenced by the load of the machine and other factors.

mpvl commented 2 years ago

Even with performance problems fixed, as long as there are comprehensions, it is easy to construct something that will taken an inordinate amount of time. So supporting cancellable evaluation is either way a good feature to have.