DISTINCTCOUNT and percentile doesn't utilize indexes to filter data

meshari343 commented 4 weeks ago

hello, I have noticed that DISTINCTCOUNT and percentile doesn't utilize indexes to filter data.

below are the queries and their plan, based on the quick start airline dataset:

1- DISTINCTCOUNT query: select DISTINCTCOUNT(Cancelled) from airlineStats where Origin = 'ATL'

plan: "resultTable": { "dataSchema": { "columnNames": [ "Operator", "Operator_Id", "Parent_Id" ], "columnDataTypes": [ "STRING", "INT", "INT" ] }, "rows": [ [ "BROKER_REDUCE(limit:10)", 1, 0 ], [ "COMBINE_AGGREGATE", 2, 1 ], [ "PLAN_START(numSegmentsForThisPlan:31)", -1, -1 ], [ "AGGREGATE(aggregations:distinctCount(Cancelled))", 3, 2 ], [ "PROJECT(Cancelled)", 4, 3 ], [ "DOC_ID_SET", 5, 4 ], [ "FILTER_FULL_SCAN(operator:EQ,predicate:Origin = 'ATL')", 6, 5 ] ] }

query response stats: "requestId": "93259971000000019", "brokerId": "Broker_172.17.0.2_8000", "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 31, "numSegmentsProcessed": 31, "numSegmentsMatched": 31, "numConsumingSegmentsQueried": 0, "numConsumingSegmentsProcessed": 0, "numConsumingSegmentsMatched": 0, "numDocsScanned": 629, "numEntriesScannedInFilter": 9746, "numEntriesScannedPostFilter": 629, "numGroupsLimitReached": false, "totalDocs": 9746, "timeUsedMs": 10, "offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0, "offlineResponseSerializationCpuTimeNs": 0, "realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0, "realtimeTotalCpuTimeNs": 0, "brokerReduceTimeMs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 1, "numSegmentsPrunedByBroker": 0, "numSegmentsPrunedByServer": 0, "numSegmentsPrunedInvalid": 0, "numSegmentsPrunedByLimit": 0, "numSegmentsPrunedByValue": 0, "explainPlanNumEmptyFilterSegments": 0, "explainPlanNumMatchAllFilterSegments": 0

2-percentile query: select percentile(AirTime, 50) from airlineStats where Origin = 'ATL'

plan: "resultTable": { "dataSchema": { "columnNames": [ "Operator", "Operator_Id", "Parent_Id" ], "columnDataTypes": [ "STRING", "INT", "INT" ] }, "rows": [ [ "BROKER_REDUCE(limit:10)", 1, 0 ], [ "COMBINE_AGGREGATE", 2, 1 ], [ "PLAN_START(numSegmentsForThisPlan:31)", -1, -1 ], [ "AGGREGATE(aggregations:percentile(AirTime))", 3, 2 ], [ "PROJECT(AirTime)", 4, 3 ], [ "DOC_ID_SET", 5, 4 ], [ "FILTER_FULL_SCAN(operator:EQ,predicate:Origin = 'ATL')", 6, 5 ] ] }

query response stats: "requestId": "93259971000000018", "brokerId": "Broker_172.17.0.2_8000", "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 31, "numSegmentsProcessed": 31, "numSegmentsMatched": 31, "numConsumingSegmentsQueried": 0, "numConsumingSegmentsProcessed": 0, "numConsumingSegmentsMatched": 0, "numDocsScanned": 629, "numEntriesScannedInFilter": 9746, "numEntriesScannedPostFilter": 629, "numGroupsLimitReached": false, "totalDocs": 9746, "timeUsedMs": 10, "offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0, "offlineResponseSerializationCpuTimeNs": 0, "realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0, "realtimeTotalCpuTimeNs": 0, "brokerReduceTimeMs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 1, "numSegmentsPrunedByBroker": 0, "numSegmentsPrunedByServer": 0, "numSegmentsPrunedInvalid": 0, "numSegmentsPrunedByLimit": 0, "numSegmentsPrunedByValue": 0, "explainPlanNumEmptyFilterSegments": 0, "explainPlanNumMatchAllFilterSegments": 0

Jackie-Jiang commented 2 weeks ago

Have you configured inverted index on Origin column?

meshari343 commented 2 weeks ago

Yes, if another column that’s not included in ST split had an index it will be utilized, otherwise no matter what index is used on the filter if it’s used in ST no other index will be utilized for distinct count, or percentile with filters.

Jackie-Jiang commented 2 weeks ago

I see. Inverted index cannot be applied on top of star-tree index. The reason of the scanning is that there is not enough doc to reach the maxLeafRecords, thus star-tree stops splitting. You may try setting the maxLeafRecords to 1 so that star-tree will always fully split on all dimensions to avoid scanning

meshari343 commented 2 weeks ago

I did that but it still will do a full_filter_scan.

Is there any index that can be used with star tree index? or will star tree block the usage of any additional index, if it’s I think this is an important fact to document.

Jackie-Jiang commented 1 week ago

Can you share the star-tree index config? There should be no scan if the dimension is properly split. The split dimension is sorted, thus binary search is applied instead of scanning. I noticed that the query is doing DISTINCT_COUNT and PERCENTILE, which are not supported as star-tree aggregate because the size is unbounded

mayankshriv commented 1 week ago

@Jackie-Jiang This seems like a bug we should fix. There is no way to use ST index in this case but it also prevents inv index from being created/used.

meshari343 commented 1 week ago

Hello, just wanted to confirm something, if the filter was used in the ST split, but the aggregation was not specified in ST config and the operation is supported will the ST index be used as a filter or not?

Jackie-Jiang commented 5 days ago

There is no way to use ST index in this case but it also prevents inv index from being created/used.

@mayankshriv Inverted index is being created/used. What is missing is inverted index within the ST index. When ST index is NOT utilized, regular inverted index is picked up.

Hello, just wanted to confirm something, if the filter was used in the ST split, but the aggregation was not specified in ST config and the operation is supported will the ST index be used as a filter or not?

@meshari343 No. ST index can be used only when the aggregation is defined in the ST config.

apache / pinot

DISTINCTCOUNT and percentile doesn't utilize indexes to filter data #13833