Open meshari343 opened 4 weeks ago
Have you configured inverted index on Origin
column?
Yes, if another column that’s not included in ST split had an index it will be utilized, otherwise no matter what index is used on the filter if it’s used in ST no other index will be utilized for distinct count, or percentile with filters.
I see. Inverted index cannot be applied on top of star-tree index. The reason of the scanning is that there is not enough doc to reach the maxLeafRecords
, thus star-tree stops splitting. You may try setting the maxLeafRecords
to 1 so that star-tree will always fully split on all dimensions to avoid scanning
I did that but it still will do a full_filter_scan.
Is there any index that can be used with star tree index? or will star tree block the usage of any additional index, if it’s I think this is an important fact to document.
Can you share the star-tree index config? There should be no scan if the dimension is properly split. The split dimension is sorted, thus binary search is applied instead of scanning.
I noticed that the query is doing DISTINCT_COUNT
and PERCENTILE
, which are not supported as star-tree aggregate because the size is unbounded
@Jackie-Jiang This seems like a bug we should fix. There is no way to use ST index in this case but it also prevents inv index from being created/used.
Hello, just wanted to confirm something, if the filter was used in the ST split, but the aggregation was not specified in ST config and the operation is supported will the ST index be used as a filter or not?
There is no way to use ST index in this case but it also prevents inv index from being created/used.
@mayankshriv Inverted index is being created/used. What is missing is inverted index within the ST index. When ST index is NOT utilized, regular inverted index is picked up.
Hello, just wanted to confirm something, if the filter was used in the ST split, but the aggregation was not specified in ST config and the operation is supported will the ST index be used as a filter or not?
@meshari343 No. ST index can be used only when the aggregation is defined in the ST config.
hello, I have noticed that DISTINCTCOUNT and percentile doesn't utilize indexes to filter data.
below are the queries and their plan, based on the quick start airline dataset:
1- DISTINCTCOUNT query: select DISTINCTCOUNT(Cancelled) from airlineStats where Origin = 'ATL'
plan: "resultTable": { "dataSchema": { "columnNames": [ "Operator", "Operator_Id", "Parent_Id" ], "columnDataTypes": [ "STRING", "INT", "INT" ] }, "rows": [ [ "BROKER_REDUCE(limit:10)", 1, 0 ], [ "COMBINE_AGGREGATE", 2, 1 ], [ "PLAN_START(numSegmentsForThisPlan:31)", -1, -1 ], [ "AGGREGATE(aggregations:distinctCount(Cancelled))", 3, 2 ], [ "PROJECT(Cancelled)", 4, 3 ], [ "DOC_ID_SET", 5, 4 ], [ "FILTER_FULL_SCAN(operator:EQ,predicate:Origin = 'ATL')", 6, 5 ] ] }
query response stats: "requestId": "93259971000000019", "brokerId": "Broker_172.17.0.2_8000", "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 31, "numSegmentsProcessed": 31, "numSegmentsMatched": 31, "numConsumingSegmentsQueried": 0, "numConsumingSegmentsProcessed": 0, "numConsumingSegmentsMatched": 0, "numDocsScanned": 629, "numEntriesScannedInFilter": 9746, "numEntriesScannedPostFilter": 629, "numGroupsLimitReached": false, "totalDocs": 9746, "timeUsedMs": 10, "offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0, "offlineResponseSerializationCpuTimeNs": 0, "realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0, "realtimeTotalCpuTimeNs": 0, "brokerReduceTimeMs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 1, "numSegmentsPrunedByBroker": 0, "numSegmentsPrunedByServer": 0, "numSegmentsPrunedInvalid": 0, "numSegmentsPrunedByLimit": 0, "numSegmentsPrunedByValue": 0, "explainPlanNumEmptyFilterSegments": 0, "explainPlanNumMatchAllFilterSegments": 0
2-percentile query: select percentile(AirTime, 50) from airlineStats where Origin = 'ATL'
plan: "resultTable": { "dataSchema": { "columnNames": [ "Operator", "Operator_Id", "Parent_Id" ], "columnDataTypes": [ "STRING", "INT", "INT" ] }, "rows": [ [ "BROKER_REDUCE(limit:10)", 1, 0 ], [ "COMBINE_AGGREGATE", 2, 1 ], [ "PLAN_START(numSegmentsForThisPlan:31)", -1, -1 ], [ "AGGREGATE(aggregations:percentile(AirTime))", 3, 2 ], [ "PROJECT(AirTime)", 4, 3 ], [ "DOC_ID_SET", 5, 4 ], [ "FILTER_FULL_SCAN(operator:EQ,predicate:Origin = 'ATL')", 6, 5 ] ] }
query response stats: "requestId": "93259971000000018", "brokerId": "Broker_172.17.0.2_8000", "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 31, "numSegmentsProcessed": 31, "numSegmentsMatched": 31, "numConsumingSegmentsQueried": 0, "numConsumingSegmentsProcessed": 0, "numConsumingSegmentsMatched": 0, "numDocsScanned": 629, "numEntriesScannedInFilter": 9746, "numEntriesScannedPostFilter": 629, "numGroupsLimitReached": false, "totalDocs": 9746, "timeUsedMs": 10, "offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0, "offlineResponseSerializationCpuTimeNs": 0, "realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0, "realtimeTotalCpuTimeNs": 0, "brokerReduceTimeMs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 1, "numSegmentsPrunedByBroker": 0, "numSegmentsPrunedByServer": 0, "numSegmentsPrunedInvalid": 0, "numSegmentsPrunedByLimit": 0, "numSegmentsPrunedByValue": 0, "explainPlanNumEmptyFilterSegments": 0, "explainPlanNumMatchAllFilterSegments": 0