Open astefan opened 3 weeks ago
Pinging @elastic/es-analytical-engine (Team:Analytics)
I wrote a CSV test that reproduces the first of these using only a ROW
command to eliminate any data dependency:
count distinct bug
ROW salary = 5.2
| STATS cd1=count_distinct(salary, 1000), cd2=count_distinct(salary, 3000 - 1000 + 1000), cd3=count_distinct(salary, 3000)
;
cd1:long|cd2:long|cd3:long
1 |1 |1
;
Description
from employees | stats cd1=count_distinct(salary, 3000), cd2=count_distinct(salary, 3000 + 1000 - 1000), cd3=count_distinct(salary, 1000)
fails with
The problem is visible on the data nodes which know that the sink they need to write to should have three channels (one for each aggregation), but on the coordinator node (in the
AggregateMapper
) there are only two intermediate attributes created (the first twocount_distinct
s are identical so one attribute).At that point in code (
AggregateMapper
) unexpectedly(?) "duplicated" aggregations are present. In the logical optimizer there is a rule that deduplicates identical aggregates -ReplaceStatsAggExpressionWithEval
- but it's doing that before any folding takes place.from employees | stats m = median(salary_change), p50 = percentile(salary_change, 50), count = count(salary_change)
fails with
For this one the intermediate attributes are deduplicated in the AggregateMapper and
median
andpercentile
are considered identical because themedian
is rewritten aspercentile(salary, 50)
.At the logical optimizer level, again
ReplaceStatsAggExpressionWithEval
cannot do the deduplication of aggregations becauseSubstituteSurrogates
runs after it.