elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.73k stars 24.68k forks source link

ESQL: Refactor STATS substitution optimizer rules #110345

Open alex-spies opened 3 months ago

alex-spies commented 3 months ago

In the substitutions batch of our LogicalPlanOptimizer, there's 4 rules that take an expression like | STATS foo = avg(x*x) + 2 and turn this into a simple aggregation with enclosing EVALs; in this example, this becomes (essentially)

| EVAL $$x = x*x
| STATS $$foo_sum = sum($$x), $$foo_count = count($$x)
| EVAL $$foo = $$foo_sum/$$foo_count, foo = $$foo + 2
| KEEP foo

This is becoming complicated and more difficult to argue about due to the substitutions happening in 4 rules; let's see if we can do with just 2 rules.

More specifically,

  1. ReplaceStatsNestedExpressionWithEval turns STATS avg(x*x) + 2 into EVAL $$x = x*x | STATS foo = avg($$x) + 2.
  2. ReplaceStatsAggExpressionWithEval then turns | STATS foo = avg($$x) + 2 into | STATS $$foo = avg($$x) | EVAL foo = $$foo + 2
  3. SubstituteSurrogates replaces | STATS $$foo = avg($$x) by | STATS $$foo_sum = sum($$x), $$foo_count = count($$x) | EVAL $$foo = $$foo_sum/$$foo_count
  4. Then we run ReplaceStatsNestedExpressionWithEval again to account for stuff that happened in TranslateMetricsAggregate

It makes sense that there's 1 rule that creates EVALs after the aggregation (ReplaceStatsNestedExpressionWithEval) and one that pulls nested expressions out of agg functions into an EVAL before the aggregation (ReplaceStatsAggExpressionWithEval).

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

astefan commented 2 months ago

SubstituteSurrogates does something ok now, but considering https://github.com/elastic/elasticsearch/issues/100634 this rule should be executed multiple times instead. PropagateEvalFoldables (when enabled for aggregates as well) should cover cases where the foldable expression is not inside the aggregate, for example eval x = [5,6,7] | stats max(x). And, when PropagateEvalFoldables is executed, the SubstituteSurrogates rule is no longer executed.