apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

DISTINCTCOUNTMV returns incorrect value when field is included in select/group by #12429

Open aw381246 opened 8 months ago

aw381246 commented 8 months ago

If the field passed into the distinctcountmv function is included in the list of select / group by fields, it will return incorrect counts.

In the case below, the result for each row should be a distinctcount = 1 instead of 2.

If the array has n number of items, the distinctcount will return n instead of 1

image image

Jackie-Jiang commented 7 months ago

This is similar to #12230, and this behavior is expected because of how pinot execute queries. We have built VALUE_IN to work around this problem

aw381246 commented 7 months ago

@Jackie-Jiang, if I understand VALUE_IN, I could specify a specific "host" in the query above, and I'd only get the first row, and the distinct count would = 1 right? But that doesn't solve the problem of returning every host with the correct distinctcountmv? In other words, if "1.27.12.151" is in an MV field with either 0 or 100 other hosts, the distinctcountmv should return 1 either way.

Jackie-Jiang commented 7 months ago

I don't fully follow. What do you want to achieve with the query?

aw381246 commented 7 months ago

If you group by the same column that is in the distinctcount, every row in the result should have a distinctcount of 1. The only reason pinot is returning a value greater than 1 is because the column is an MV column that has 2 values in this case. Pinot unnests the MV column when you group by it, but the distinctcount seems to be happening before the unnest instead of after the unnest.

Here's a simple example I created in SQL Server: image

Jackie-Jiang commented 7 months ago

Pinot MV semantic works as following: