Open gyorfimi opened 5 months ago
could you attach the explain plan result for this query.
This is actually a problem b/c the mv column is not exactly semantically "relational algebra" it is a short-handed representation of saying "it should behave as an array (with some extra op) when xxx and should behave as a set (with some extra op) when yyy"
so there's really no convention or compatibility fix.
for this specific query
select id, mvc from (
select id, arraytomv(mv) as mvc from mytable group by id, mvc) embd
where mvc = 'a'
the where clause is not because it is not suppose to be pushed into the inner query, but the question on whether we should applied it as a filter before/after the group-by.
agree we definitely need some redesign to make sure this behavior is clear. even if it means some of the existing syntax wont make sense. for example: in this case, UNNEST must be explicitly specify and mvc should be treated as normal ARRAY type. then the confusion should go away
Multistage engine and multivalue column semantics incompatibilities
In Pinot filtering on a multivalue column with
=
operator actually a contains operator, aggregation to a multivalue column is actually an unnest operation.It works well in 1.0 engine even with filtering the result (multivalue column) in the
HAVING
clause.However, in 2.0 engine, this irregular behavior is not compatible with relation algebra semantics. That's why one should use ArrayToMv function, which tries to break the border between "normal" field semantics and multivalue field semantics.
In most cases it works well, but there are some cases when it doesn't work as expected.
Example:
(assuming that
mvc
is a multivalue column)If there is a row in
mytable
withmvc
containinga
andb
, the result will contain two rows with the sameid
andmvc
=a
andb
respectively (even though thewhere
clause is applied).The main cause of this behavior is that relation algebra transformation pushes down the
where
clause to the inner query like this:select id, mvc from (select id, arraytomv(mv) as mvc from mytable where arraytomv(mv) = 'a' group by id, mvc)
In normal field semantics, this query will return only one row with
mvc
=a
. However, in multivalue field semantics, it will return two rows withmvc
=a
andb
respectively.So a complete rework of multivalue operators is needed. Multivalue is actually a set, not a value, so for example an
UNNEST(...)
operator is suggested (Calcite supports it), and for filtermv = SOME(...)
is suggested (Calcie supports it too).For fast fix of this issue, it should be prevented to push down filter on the multivalue-column group by query to the inner query.