Open myitcv opened 9 months ago
As a follow up to this issue and the nexts steps outlined in https://github.com/cue-lang/cue/discussions/2857, we are arranging an Evaluator and Performance Office Hours on Thursday 29 Feb, 2024 at 1600 UTC. For more details please see https://github.com/cue-lang/cue/discussions/2862.
To update on the progress of the development of the new evaluator, a critical component of making performance improvements in CUE.
The Evaluator Roadmap project is gradually being filled out with issues to communicate our progress in this part of the CUE project. For more information on the approach we are following here, please see https://github.com/cue-lang/cue/discussions/2846.
Back in December 2023 on our last community call, we updated to confirm the new evaluator was close to passing all tests except for disjunctions. The changes to support this work (i.e. everything except disjunctions) have since landed in the main CUE repo. We will be submitting the final key change in this series soon.
Since December, I have been making good progress with disjunctions. We originally planned to have a simple implementation for disjunctions to speed up the roll out. That turned out to be problematic. The latest implementation (for the new evaluator), although still conceptually simpler than the previous one, took a bit more effort than anticipated. Things are going well now, though.
We now plan to submit the changes to support disjunctions in the new evaluator in an incremental fashion.
A bit more about next steps.
https://github.com/cue-lang/cue/issues/2853 is largely done with the exception of reclaiming memory (see https://github.com/cue-lang/cue/issues/2887) and some esoteric bugs.
We first need to land https://review.gerrithub.io/c/cue-lang/cue/+/1174341, the main change implementing the new evaluator. This is being tracked in https://github.com/cue-lang/cue/issues/2884.
Once the biggest bugs with disjunctions are ironed out (which we are doing via tests and code that is part of Unity), we plan to allow users to play with the new evaluator, enabling it via a CUE_EXPERIMENT
. This is being tracked in https://github.com/cue-lang/cue/issues/2886.
Note that this initial experimental version of the new evaluator will not include memory reuse or any of the secondary optimizations, such as structure sharing (see https://github.com/cue-lang/cue/issues/2854). So even though users should see an improvement in the big-O performance of the core evaluator, it might be that we initially see performance get worse in some instances. With such a significant change in implementation, this is somewhat to be expected because the implementations of the evaluator (old/current and new) have very different designs and therefore characteristics.
Once support for enabling the CUE_EXPERIMENT
is released, the next step is to enable reuse buffers. We have disabled memory reclamation as it simplifies, and thus speeds up, development. Enabling it will give a critical boost. For instance, the old closedness algorithm would accumulate oodles of auxiliary structures that could not be reclaimed (see https://github.com/cue-lang/cue/issues/2853). The new evaluator allows reclaiming this memory after any small evaluation increment. Until this is enabled, however, users will not see much benefits of this new algorithm. This is tracked in https://github.com/cue-lang/cue/issues/2887.
We also need to ensure that error messages in the new evaluator are at least on par with the old evaluator. This is tracked in https://github.com/cue-lang/cue/issues/2890. The new evaluator enables more structured capturing of error situations, and therefore better error reporting. Improving error messages is tracked via https://github.com/cue-lang/cue/issues/2891.
After this is done, the real fun starts. The new evaluator has been designed with several additional performance improvements in mind. Each of those can give substantial constant speedup, or even additional big-O improvements. These can all be rolled out incrementally. We will be creating issues as part of the Evaluator Roadmap for each of these.
This update being the first of its kind, means it necessarily includes background and context on where we find ourselves today. Future updates (which are aiming to be at least fortnightly) will be shorter. In future updates we will provide shorter bullets under the heading of the sub-issues:
Here is a list of the active changes related to the evaluator and performance work, referenced either directly or indirectly above.
Development on the new evaluator is progressing steadily. The main focus remains implementing the new evaluator as the basis for performance and other improvements (#2884) and making the new evaluator available via a CUE_EXPERIMENT
(#2886). As a reminder, we track the main areas of work on the evaluator via the Evaluator Roadmap.
At the time of our last update, 87% of all evaluator test files were run and free of P0 or P1 bugs, while 19 files were excluded from the tests entirely.
Now, 94% of these tests pass, with only 3 files excluded from tests. Moreover, before many files had many errors. Now, the 23 (out of 346) remaining files have at most a few errors remaining.
(As a reminder, we categorize bugs/errors according to their severity: P0 being critical, through to P3.)
We also added some new tests. Here is a new benchmark that we will add:
#def: {
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {c: string}
{} | {a: string}
}
x: #def
x: c: "foo"
The old evaluator completed this with this many core operations:
Unifications: 2590
Conjuncts: 13409
Disjuncts: 4674
Incrementally adding lines showed that this test displayed exponential behavior.
You can collect stats for your own runs using the CUE_STATS_FILE
environment variable. This feature is subject to change.
The new evaluator accomplishes the same result with far fewer operations:
Unifications: 28
Conjuncts: 267
Disjuncts: 76
The new evaluator displays linear growth in the number of operations when adding a repeated line. Clearly an improvement.
What made this possible is the capability of the new evaluator to much more accurately detect duplicates, even when a value is only partially evaluated.
That said, there is still some work to be done. This new detection algorithm needs tuning. Some of the cycle detection algorithms are also not fully functional yet. This means that for some of the benchmarks the new evaluator performs worse, for now. But all the techniques that worked for the old evaluator will work for the new one, in addition to all the improvements made possible by the new implementation.
There were also some complications that lead to some tweaking to the disjunction algorithm. It turned out that the implementation would be much more straightforward if we had a partial implementation of structure sharing. We opted for this approach. This meant a slight setback in the projected time for implementing disjunctions in the new evaluator. On a positive note, however, it means we are now well on our way to implementing structure sharing. We will not need to finish this before launching the new evaluator in experimental form, but we will be able to introduce structure sharing sooner than expected. This is great, as there are some known performance issues that cannot be solved without it.
We intend to launch the experiment when all known P0 and P1 bugs have been fixed. We anticipate being able to do so before the end of March 2024.
And as a reminder, this is before we improve memory management. The new evaluator currently has a high constant overhead. This is known and expected. We do hope, however, that people will see a big reduction in core operations.
Apologies, this update is arriving a week later than planned.
We are getting close to releasing an experimental version of the new evaluator: we are currently at over 95% of the test files passing.
This seems like only a small increase since last time, where we were at 94%. There is a lot going on though. The number of skipped files, the most severe of all, went from 13 to 4. Also, the total number of errors in the 13 remaining erroneous files has reduced as well, which is not reflected in this count. Also, the severity of the errors has diminished. Finally, we have discovered several new errors in the process that we previously forgot to count, which slowed down the reduction a bit.
In some cases we have also deliberately introduced new minor errors. For instance, we realized that some of the handling of structural cycle detection edge cases, which is not yet implemented for the new evaluator, can be handled much more nicely with structure sharing. We are close to supporting structure sharing. So it seemed more prudent to have a less precise, but simpler handling for now, at the expense of having some spurious structural cycles in the meantime. We can use the saved time to roll out structure sharing sooner.
We have also prepared the integration of the new evaluator into the cue
command. So far, most of the cue
commands, with the notable exception of cue trim
, work with the new evaluator.
The main focus remains completing implementation of the new evaluator (#2884) and making the new evaluator available via a CUE_EXPERIMENT
(#2886). There are currently 36 outstanding changes that need to be reviewed and submitted for the experimental evaluator to be available in tip. We will update in the coming week when that work is complete.
Unlike what we said last time, the initial experimental version will contain some outstanding P0 and P1 bugs. The outstanding bugs are mainly regarding structural cycle detection, most notably when used for implementing recursion, let processing, and some smaller issues. We realize that those that do not use these constructs could already benefit from the new evaluator.
Getting early exposure and feedback will ultimately, however, speed up the rollout.
Today, we will enable the new evaluator as an experiment for the cue binary as part of v0.9.0-alpha.2. At a later point, we also plan on enabling it in the API.
The new evaluator has various known bugs that have not yet been fixed. On the flip side, it has more performance improvements than we initially planned to release.
We decided in favor of this tradeoff because, on the one hand, many users will not run into the remaining bugs and could benefit from the performance gains now, while on the other hand, some of the performance improvements will actually help address some of the remaining missing functionality that gives rise to these bugs.
Also, by enabling more users to benefit from performance gains sooner, we hope to also get more feedback on the remaining issues.
So in spite of the remaining bugs, we suggest you give the new evaluator a whirl. In our experience, it is already significant faster than the old evaluator due to some significant big-O wins.
To enable it on the command line, use the evalv3
option via CUE_EXPERIMENT
.
For instance:
CUE_EXPERIMENT=evalv3 cue export
Together with the CUE_STATS_FILE=stats.json
option, you can see in terms of the number of core operations how the old and new evaluator compare. Note, though, that counts are not entirely comparable: the new evaluator overcounts conjunctions, while the old evaluator overcounts disjunctions.
Not all of the core constructs have been ported from the old evaluator to the new one. A noteworthy one is structural cycles. We have seen that the new algorithm gives rise to a simpler and more effective structural cycle handling. Structure sharing plays a role in that.
Instead of porting the rather complex old mechanism to the new evaluator, we have decided to implement an approximation instead for now. Together with structure sharing, this already covers most of the cases. But there are cases where structural cycles are either detected too late or spuriously.
There are known issues with handling of let expressions in comprehensions.
There are some remaining issues in closedness detection. It is known where these come from, we just did not want to hold up the release on these issues, as we expect they will not affect most users.
This issue may manifest itself by disjuncts that are not eliminated that should otherwise be eliminated because of closedness constraints.
The new evaluator does not have all the locations relevant for an error as was available in the old evaluator. As soon as we are on feature parity, we plan to give a thorough overhaul on error reporting, fixing those features and others.
The new evaluator is significantly faster on many fronts. In addition to a much more capable disjunction elimination algorithm, it has a partial implementaton of structural sharing, which has proven to give some significant wins.
Both the re-implementation of disjunctions as well as structure sharing has shown significant performance improvements.
This example is similar to the hypothetical example we showed in the last update. It reflects a real-life version of it.
#Secret: $secret: id: string
#secrets: #Secret | {[string]: #secrets}
out: #secrets & {
FOO: $secret: id: "100"
ONE: TWO: THREE: $secret: id: "123"
}
#Secret: $secret: _id: string
#secrets: #Secret | {[string]: #secrets}
out: #secrets & {
FOO: $secret: _id: "100"
ONE: TWO: THREE: $secret: _id: "123"
}
Here we saw a reduction in the number of operations from
Unifications: 740771
Conjuncts: 3143640
Disjuncts: 254254
to
Unifications: 449
Conjuncts: 22643
Disjuncts: 4560
An example of the kind of improvements brought by structure sharing is this code:
f1: string
f2: f1
f3: f2
f4: f3
f5: f4
f6: f5
…
f997: f996
f998: f997
f999: f998
f1000: f999
Where we reduce the number of conjunct operations from 500501
to 2001
, basically as a result reducing this case from O(n^2)
to O(n)
.
Similarly, a reduced version of the Billion Laughs attack, shown here
f: [g, g, g, g, g, g, g]
g: [h, h, h, h, h, h, h]
h: [i, i, i, i, i, i, i]
i: [j, j, j, j, j, j, j]
j: 1
reduces the number of operations from
Unifications: 3268
Conjuncts: 6529
to
Unifications: 34
Conjuncts: 95
Here the new evalautor is O(n)
in the number of input nodes!
Finally, the improvements of the new disjunction algorithm can also compound with structure sharing. This is an example where we have recursion combined with disjunctions, a common way to get into serious performance issues with the old evaluator.
A: #Task
B: #steps: #Script & {mount: [A]}
C: #steps: #Script & {mount: [B]}
#Script: {mount: [...#Task]}
#Task: {
#ref
#ref
_ | {}
_ | {}
#steps: #Script
...
}
#ref: {a: 1} | {b: 2}
The old evaluator clocked in at the following number of operations:
Unifications: 18724
Conjuncts: 100730
Disjuncts: 5373
The new evaluator, with its new disjunction algorithm, only needed
Unifications: 2716
Conjuncts: 48270
Disjuncts: 3272
A reasonable improvement, especially since it is counts conjuncts more aggressively than the old evaluator.
Adding structure sharing in the mix, however, this further reduces to
Unifications: 53
Conjuncts: 431
Disjuncts: 40
Now we're talking!
There are a few cases where it may be slower than the old evaluator:
The caching of the old evaluator has not been fully ported, and thus in some cases, the new evaluator may be slower.
The constant overhead per node is much higher for the new evaluator. Basically, unlike with the old evaluator, we have not yet done any memory allocation optimizations. Also, it has some significant overhead for gathering debugging information.
The disjunction deduplication algorithm still needes tuning. There is a good change we still miss opportunities to discard paths in disjunction processing early on. If you have any disjunction-related performance issues, please let us know so that we can further tune this algorithm.
The new evaluator does not address all performance issues yet. Note that the old evaluator also does not implement these, so there is no benefit to stick to the old evaluator here. We mainly want to point these out on what is still in store. While by no means being exhaustive, here are some examples:
Discriminator fields for disjunctions. There is still a lot of gain to be made making disjunctions run in linear time in many cases.
Fine-grained structure sharing. We know that this will be critical for configurations with a lot of chained references that are more complex than that is covered with simple structure sharing. Especially tools/flow
applications may suffer from this. We had a prototype for the old evaluator that showed large improvements on this front, but due to limitations of the old evaluator we were not able to release. This new evaluator makes this more feasible, and this is one of the higher-priority performance gains to roll out.
Merging of very large structs. Merging structs is still O(n^2) in the number of fields. This is with a smallish constant overhead, but it may matter for very large structs. We know that this can be done in O(n) time. We plan to implement an O(n) algorithm along with the design of a more principled approach to field ordering.
Memory required for closedness computation is not reclaimed and thus is still expensive. The new evaluator is designed to be able to reclaim and reuse this memory in small, compact chunks, but this has not yet been implemented. Also the tree depth of this algorithm can still be significantly reduced.
We would love for you to give it a spin and get some feedback. The sooner we know about issues, the sooner we can fix it. The new evaluator is not only faster, but also quite a bit easier to debug, we have found. So do not hesitate to help us get you up and running faster.
Since last release, we have fixed quite a few more bugs. Based on popular demand, however, we have started on a little diversion from fixing bugs to work on integrating the new evaluator in the API.
Hooking up the new evaluator to the API may seem like a small step, but is actually quite involved. The evaluator itself is quite low level. The API provides a layer on top that interprets the low-level evaluation results and presents it to the user in a more convenient manner. It is actually good we have it, as it allows us to reimplement an evaluator without users having to change their API usage!
There are many subtle and less subtle ways in which the new evaluator changes the representation, for instance:
The API needs to take all these change into account, and now handle both.
Luckily the API itself has many tests separately from the core evaluator. We are now applying the main testing framework we developed for the evaluator to track deviations to the API as well, where possible, or coming up with other solutions where this is not the case. You can track the progress at Issue #3060.
We have already fixed many of the API issues, and work on this is progressing nicely. It even resulted in some minor bug fixes in the core evaluator itself.
After the biggest bugs have been removed from the API support, we plan to release an option to enable the new evaluator.
After this is done, we will resume fixing the outstanding larger bugs in the evaluator. For instance, let
evaluations can still be rather buggy.
As usual, we would love to hear your feedback. Please raise issues or get in touch with us via Slack.
The upcoming v0.9.0-alpha.5 release is another big step forward for the new evaluator.
We have fixed some of the major bugs that were outstanding. Based on popular demand, however, our main focus has been to integrate the new evaluator in the Go API. This is not as trivial as it sounds. So far we have rolled out about 40 CLs to accomplish this. The main reason for this is that the underlying data representation of the new evaluator differs from the old evaluator in some important aspects. This is hidden from the user in the API, but the API had to be adjusted to accommodate this.
So far we have squashed most of the couple of dozen regressions that originated from this move and have only two relevant regressions left. Some of the remaining issues include:
Value.Unify
: recursive closedness is not always handled correctly for.tools/trim
: mostly does not yet work with the new evaluator.Things like dependency analysis, tools/flow
and various other packages fully pass all tests with the new evaluator.
The new evaluator can be enabled in the Go API by passing options to cue/cuecontext
:
ctx := cuecontext.New(cuecontext.EvaluatorVersion(cuecontext.EvalV3))
...
ctx.CompileString("a: 1")
Using EvalV3
will ensure the version is fixed once the new evaluator stabilizes. Use EvalExperiment
to always resort to the latest evaluator experiment.
There is also an option to pass debug flags. This takes the same form as the flags supported by the environment variables. The most notable option is the one to turn off structure sharing, which is enabled by default. If you encounter a bug that you suspect may be related to structure sharing, it can be turned off using the CUE_DEBUG
option:
ctx := cuecontext.New(
cuecontext.EvaluatorVersion(cuecontext.EvalV3),
cuecontext.CUE_DEBUG("sharing=false"),
)
...
ctx.CompileString("a: 1")
Note that for the cue
command the same can be accomplished by setting the CUE_DEBUG
environment variable to sharing=true
. Environment variables do not affect the API usage: all options need to be set explicitly in the Go code. To use the existing debug environment variables, use cuecontext.CUE_DEBUG(os.Getenv("CUE_DEBUG"))
.
With respect to progress of the core evaluator itself, there are now no more skipped tests left, and just about a dozen of high-priority regressions remain. Now all tests are running, we can see a 83% reduction in running time of running the same tests for the old evaluator versus the new evaluator. Of course this is just a snapshot, but it gives some indication of what to expect. Of course, the new evaluator will be much faster for some cases, and will be equal in some other cases. Keep in mind that we still have a lot of optimizations in store, some quite large as well.
Since the last update, we have been working on responding to user bugs as well as tackling the remaining bugs. There have been a couple of weeks worth of work on other things, including a new embed proposal, and also some investigations on other blockers, which, not entirely coincidentally, also impact the new evaluator.
We have removed most bugs of the API support for the new evaluator. There are a few left. The biggest omission is CUE trim. The old trim algorithm was fully designed around the old algorithm and its respective data structures. Consequently, as the new evaluator often uses different data structures, trim completely stopped working for the new evaluator. Matthew Sackman is now looking into a redesign of the trim algorithm, addressing a whole slew of bugs, while also making it work for the new evaluator.
Another aspect of the new evaluator that has greatly been improved is subsumption. Some of the improvements and bug fixes made it also in the old evaluator. But for the most part, these improvements are only supported for the new evaluator. These improvements will be crucial for both optimization and defining backwards compatibility for CUE schema.
Finally, the number of bugs of the core evaluator has been steadily decreasing as well. Moreover, we are cleaning up the algorithm of the structural cycle detection of the new evaluator to handle more cases in a simpler fashion. This was one of the major omissions of the new implementation and appears to be a factor in many of the remaining bugs.
We aim for the new evaluator to be usable for most users by the end of August.
After returning from our summer break, we have picked up work on the evaluator again. Since our last update, most of the new evaluator work has gone in to addressing user-reported bugs. We have fixed a myriad of performance bugs, semantic bugs, and panics. @cuematthew has already been helpful and is continue to ramping up on the internals of the evaluator.
Another source of v3
-related bugs was the work related to JSON Schema. We are doing a big overhaul of CUE to significantly improve compatibility and coverage. We are ensuring that this new technology also works with v3
, which also provided a nice stress test for the v3
evaluator.
Other work has been focussed on reducing the number of closeContext allocations and getting the closedness algorithm right. We have not yet enabled lifetime management, which means that v3
still does exorbitant numbers of allocations. We are using this to expose bad performance bugs as much as possible. Our intention is still to get v3
to be much faster than v2
before we add such optimizations.
Another noteworthy development is the development of an internal "TinyCUE". Matthew has been developing a parallel new implementation of CUE, a simpler reference implementation, in order to improve his overall understanding of CUE evaluation. We are investigating to keep maintaining this implementation to insure quality in the future.
We were hoping to be able to switch over the default to v3
now. A few things are stopping us from doing so. One big discrepancy is field order. The difference between v3
and v2
is too inconsistent. We want to address this by having a better defined field order, and make field order part of the spec. We will have an offsite mid-September to sit together and speed through some of these remaining issues.
That said, we have several users increase their use of v3
and we are ramping up the usage internally to get it to a point that it can be made the default as quickly as possible.
It has ben a while since the last update. We have continued our focus on user-reported bugs. Also the JSON Schema work has exposed more bugs that we have addressed. Since last time, we have closed close to 50 issues related to the new evaluator. Part of this was designing a new and simplified cycle algorithm. The new algorithm is considerably simpler compared to the old.
New since last time is the focus on Unity. We are working to remove discrepancies in Unity. So if you want some assurance the new evaluator will work with your code, be sure to add your repo to Unity!
There are about 10 main issues left in making Unity compatible. One of them is actually a closeness bug fix that causes too much backwards incompatibility. Consider this example:
#Def: a: 1
x: (#Def & {b: 2}).b
The old evaluator would allow this, even though #Def
does not allow field b
. The new evaluator fixed this bug, but, unfortunately, it causes a lot of configurations to break. Our intention is to reintroduce the bug to simplify the transition to the new evaluator and fix this at a later point.
With respect to outstanding bugs, the new evaluator is now almost on par with the old one. We are considering flipping the default as soon as Unity passes and some known remaining performance issues are fixed.
With respect to Unity, @cuematthew has completed a first round on the field ordering work. The aim is to make differences between V2 and V3 easier to compare.
Aside from the remaining performance issues, we also have made some headway with fine-grained structure sharing. This has potential to greatly improve the performance of tools/flow
as well as any configuration that has longish chains of references. We will not focus on fixing the memory management before making the switch. We expect that the outstanding algorithmic improvements will have a much bigger impact on performance than addressing the memory allocations. Of course we will still address this the future.
Performance is a key goal of the CUE project. Broadly, @mpvl envisages being able to achieve performance improvements of several orders of magnitudes (predominantly by means of reductions in the time complexity).
Performance is also a key requirement of developer tooling, for example the CUE LSP.
This umbrella issue exists to summarize the main categories of current performance problems that people run into. It will link to sub-issues that capture more specific detail on well-identified categories of performance problems. We will also use this umbrella issue to post high-level updates on performance matters: it will therefore be locked to allow subscribing to updates from @mpvl.
Please subscribe to this issue to receive notifications of those updates.
The sub-issues linked below this umbrella issue will include more detail of each category of performance problem. Each sub-issue will link via tasklists to bug reports of performance issues that are known to suffer from the category of performance problem. High-level updates specific to each category of performance problem posted to those interested in subscribing to sub-issues.
The main umbrella issue and sub-issues will be part of the Evaluator Roadmap project (see our recent announcement of new approach to issue/project management). Bug reports linked from sub-issues will not be added to the Evaluator Project because they are entirely covered by the sub-issue.
The main categories of performance problems are:
For discussion of this issue and all matters related to performance please see https://github.com/cue-lang/cue/discussions/2857.
FAQ
My CUE is slow to evaluate - what should I do? Please raise an issue and we will help to investigate. Part of that investigation will involve working out which of the categories of performance problems affect your code, and linking to your issue from a tasklist in the relevant sub-issue.
Where can we discuss things relating to this umbrella issue?
For discussion about performance issues, please see https://github.com/cue-lang/cue/discussions/2857, or create a new discussion and we will be happy to join the exchange.
Why is my performance bug not part of the Evaluator Project?
Where there are well-identified categories of performance problems, it makes sense to group these as sub-issues under the umbrella performance issue. This helps to keep the project board small, whilst not losing any detail in the original report. The linkage to the original bug reports is managed from sub-issues via tasklists.