Open superhawk610 opened 2 years ago
I'm facing this problem as well, and I'm not being able to find a work around it (using SQL). By doing a query that is like the following:
SELECT
LOOKUP(id, 'lookup_name') AS foo,
GROUPING(LOOKUP(id, 'lookup_name')) AS __grouping__,
COUNT(*) AS "count"
FROM
datasource
GROUP BY GROUPING SET((LOOKUP(id, 'lookup_name')), ())
You get all the resulting rows with 1
as grouping, which makes it impossible to identify the different grouping sets.
TL;DR: For native queries, this can be solved by defining the dimensions that are not of type default as virtual columns, and treating them as regular dimensions. For SQL I wasn't able to find a direct solution.
I got deeper into the problem and found the following:
The GROUPING
aggregator creates the grouping key here. This keyDimensions
set is created here upon aggregation to check against the groupings
. This is created by extracting the dimensions with the getDimension
function. For lookup dimensions, for example, this refferences the dimension being passed into the lookup. As stated on the comments on that code, using outputName
at the moment of aggregation is not possible, because aggregators are not aware of the output name of the output column names:
// KeyDimensionNames are the input column names of dimensions. Its required since aggregators are not aware of the
// output column names.
// As we exclude certain dimensions from the result row, the value for any grouping_id aggregators have to change
// to reflect the new grouping dimensions, that aggregation is being done upon. We will mark the indices which have
// grouping aggregators and update the value for each row at those indices.
Why does it fail? Because the values inside the keyDimensions
set do not correspond to the values that are added to the groupings
list on the grouping aggregator. These only coincide for actual dimensions of the table, not transformations.
Nevertheless, there is a way out of the rabbit hole. As shown here, the way that the GROUPING
SQL function is converted into a Native query, is by extracting the dimension or virtual column that is being grouped. As shown in the code:
if (expression.isDirectColumnAccess()) {
return expression.getDirectColumn();
}
String virtualColumn = virtualColumnRegistry.getOrCreateVirtualColumnForExpression(
expression,
node.getType()
);
return virtualColumn;
So it would seem that there would be no problem if one would be using virtual columns. Running an EXPLAIN PLAN FOR
the query in the previous comment, shows that that SQL query is converted (roughly) to the following:
{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "datasource"
},
"intervals": {
"type": "intervals",
"intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "extraction",
"dimension": "id",
"outputName": "foo",
"outputType": "STRING",
"extractionFn": {
"type": "registeredLookup",
"lookup": "lookup"
}
}
],
"aggregations": [
{
"type": "grouping",
"name": "__grouping__",
"groupings": [
"v0"
]
},
{
"type": "count",
"name": "count"
}
],
"postAggregations": [],
"subtotalsSpec": [
[
"foo"
],
[]
]
}
Note that on the GROUPING
aggregator, the query is referring to a v0
virtual column that is not added to the query. This is created on the virtualColumnRegistry.getOrCreateVirtualColumnForExpression
on the snippet above, but for some reason that virtual column is not effectively added to the query. Later, when checking wether the corresponding dimension is being grouped on aggregator resolver, the v0
virtual column is not found and that's why we get a 1
instead of a 0
for those cases. Nevertheless, if we tweek that query to include the virtual column and address it as such, then the function works as intended.
{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "datasource"
},
"intervals": {
"type": "intervals",
"intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
},
"granularity": {
"type": "all"
},
"virtualColumns": [
{
"type": "expression",
"name": "foo",
"expression": "LOOKUP(id, 'lookup')",
"outputType": "STRING"
}
]
"dimensions": [
{
"type": "default",
"dimension": "foo"
}
],
"aggregations": [
{
"type": "grouping",
"name": "__grouping__",
"groupings": [
"foo"
]
},
{
"type": "count",
"name": "count"
}
],
"postAggregations": [],
"subtotalsSpec": [
[
"foo"
],
[]
]
}
So it looks like a possible solution for the SQL is understanding why the virtual column v0
that is being "created" is not on the final native query, and making use of it.
Affected Version
2021.11.1-iap
Description
When providing multiple sub-groups to
subtotalsSpec
, the Druid docs recommend using thegrouping
aggregator to differentiate between results. Thegrouping
aggregator reports, for a given list of dimensions, whether or not that dimension is used in a given sub-groups totals.subtotalsSpec
, however, allows providing anyoutputName
, not just dimensions. Take this example:I would expect this query to return results that look like this:
However,
__grouping__
is0b11
for all 4 results; since the "dimensions"a
andb
aren't used in any result (they're not dimensions, they're the output name for lookups). If I provideid
to thegrouping
aggregator, its corresponding bit in the output will correctly be0
for all rows, since it's used to generate both thea
andb
values, but this isn't helpful as I cannot differentiate which results are grouped bya
, and which are grouped byb
.I propose that the
grouping
aggregator allow specifyingoutputName
instead of dimension name, to align 1:1 with howsubtotalsSpec
works and allow for differentiating between output sub-groups in cases like that illustrated above.