apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.51k stars 3.7k forks source link

useApproximateCountDistinct implicitly makes all aggregates using distinct unplannable #16715

Open kgyrtkirk opened 4 months ago

kgyrtkirk commented 4 months ago

quidem test:

!set useApproximateCountDistinct false
!use druidtest://?numMergeBuffers=3
!set outputformat mysql

select sum(distinct added) from wikipedia;
+---------+
| EXPR$0  |
+---------+
| 6455074 |
+---------+
(1 row)

!ok

!set useApproximateCountDistinct true
!use druidtest://?numMergeBuffers=3
select sum(distinct added) from wikipedia;
[...]
Missing conversion is LogicalAggregate[convention: NONE -> DRUID]
There is 1 empty subset: rel#105:RelSubset#2.DRUID.[], the relevant part of the original plan is as follows
103:LogicalAggregate(group=[{}], EXPR$0=[SUM(DISTINCT $0)])
  101:LogicalProject(subset=[rel#102:RelSubset#1.NONE.[]], added=[$18])
    74:LogicalTableScan(subset=[rel#100:RelSubset#0.NONE.[]], table=[[druid, wikipedia]])
[...]
QueryInterruptedException{msg=Query could not be planned. A possible reason is [Aggregation [SUM(DISTINCT $18)] is not supported2], code=Unknown exception, class=org.apache.druid.error.DruidException, host=null}
    at org.apache.druid.query.QueryInterruptedException.wrapIfNeeded(QueryInterruptedException.java:113)
[...]
kgyrtkirk commented 4 months ago

ideally we should first put a patch on this hole to explain the situation to the user; possibly by: adding a check to DruidSqlValidtor or something like that ; to ensure that when approx is enabled only count aggregates could have isDistinct enabled; and describe to disable approx

AlbericByte commented 4 months ago

@kgyrtkirk does you mean if there is DISTINCT inside COUNT and useApproximateCountDistinct = true,then throw a CalciteContextException in SqlValidatorImpl, and describe the message like:" we should disable approx" the in the CalciteContextException.

is that your expected result?

kgyrtkirk commented 4 months ago

something like that; if we have useApproximateCountDistinct only COUNT(DISTINCT x) will work as expecteded other AGG(DISTINCT ...) will produce the above misleading exception about not able to convert the plan...which will kinda leave the user without much clue :D

I think fix could be to rewrite the count(distinct) to the sketch functions explicitly at compilation time and leave the distinct conversion rules enabled at all time

but giving a better error with a hint could give a chance to the users bumping into this to use the feature which is already present :)

AlbericByte commented 4 months ago

@kgyrtkirk next time you can assign the task me, but this is good to learn as well.

kgyrtkirk commented 4 months ago

@AlbericByte: sorry, I haven't seen a PR for it / further comments....next time I'll ask you about it.

right now that PR only restrict it for window functions

However it turned out that the situation is a bit more complicated than I was expecting when I've filed this issue - we can't just ban distinct usage for non-count(distinct) cases - as there are aggregators like string_agg which could do the distinct aggregation by themselves while translating to the native later...

We were talking about introducing some annotation markers to enable the compiler to verify this and restrict it that way.

But I'll reach out to you next time to avoid such a misunderstanding(s) - thank you for letting me know that you was not expecting the above!