apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

PERCENTILEEST throws NPE in MultiStage Engine #13305

Closed MeihanLi closed 2 months ago

MeihanLi commented 4 months ago

we found in Pinot 1.1, PERCENTILEEST throws NPE in MultiStage Engine. Does anyone know if this is already fixed in master?

example query:

WITH sub AS (select deviceOS, count(1) as total_count from userAttributes group by 1 limit 100000),
subx AS (select deviceOS, count(1) as total_count from userAttributes group by 1 limit 100000),
union_table AS (select deviceOS, total_count from sub UNION ALL select deviceOS, total_count from subx)
select deviceOS, percentile(total_count, 50) from union_table group by deviceOS 
image
Jackie-Jiang commented 4 months ago

Yes, it should be fixed within #13282

ankitsultana commented 3 months ago

Issue still exists. I am looking into it.

ankitsultana commented 3 months ago

@Jackie-Jiang : looks like current master still can't handle aggregations with literals in some cases. Repro is below.

I took a quick look and see the following issues:

I see a bunch of attempts at fixing this: #13282, #13217 and #11105.

I can take a look at this as well but wondering if you folks are already working on this and have any learnings to share. One potential solution I wanted to play with was:

Identify literals that Calcite converts to a ref and remove them. Wherever the refs are used, we should replace it with a literal. This should be done before exchange and other optimizations are run.

SET useMultistageEngine=true;

with
  teamOne as (
    select playerName, teamID, sum(runs) as sum_of_runs from baseballStats where teamID = 'SFN' group by playerName, teamID
  ),
  teamTwo as (
    select playerName, teamID, sum(runs) as sum_of_runs from baseballStats where teamID = 'CHN' group by playerName, teamID
  ),
  all as (
    select playerName, teamID, sum_of_runs from teamOne union all select playerName, teamID, sum_of_runs from teamTwo
  )
  select 
    teamID, 
    -- sum(sum_of_runs) 
    percentile(sum_of_runs, 50) 
  from all
  group by teamID

You can also try adding this hint: /*+ aggOptions(is_skip_leaf_stage_group_by='true') */ which will still throw an error because the rexliteral gets converted to a ref by Calcite (before optimization).

Jackie-Jiang commented 3 months ago

@ankitsultana Take a look at PinotAggregateExchangeNodeInsertRule, where we should already replace all ref to literal to literal. Maybe there are some corner cases not handled (e.g. the ref is not in a project)