apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.46k stars 3.7k forks source link

`grouping` aggregator should use output name instead of dimension name #12616

Open superhawk610 opened 2 years ago

superhawk610 commented 2 years ago

Affected Version

2021.11.1-iap

Description

When providing multiple sub-groups to subtotalsSpec, the Druid docs recommend using the grouping aggregator to differentiate between results. The grouping aggregator reports, for a given list of dimensions, whether or not that dimension is used in a given sub-groups totals. subtotalsSpec, however, allows providing any outputName, not just dimensions. Take this example:

{
  "queryType": "groupBy",
  "granularity": "all",
  // .. (snip) ..
  "subtotalsSpec": [["a"], ["b"]],
  "aggregations": [
    {
      "type": "grouping",
      "name": "__grouping__",
      "groupings": ["a", "b"]
    }
  ],
  "dimensions": [
    {
      "type": "lookup",
      "dimension": "id",
      "outputName": "a",
      "lookup": {
        "type": "map",
        "map": { "1": "foo", "2": "foo", "3": "bar" }
      }
    },
    {
      "type": "lookup",
      "dimension": "id",
      "outputName": "b",
      "lookup": {
        "type": "map",
        // importantly, `id=2` is in a different sub-group depending on whether
        // we're grouping by `a` or `b` (even though the base dimension, `id`,
        // is the same in each case)
        "map": { "1": "X", "2": "Y", "3": "Z" }
      }
    }
  ]
}

I would expect this query to return results that look like this:

[
  // omitting the timestamp/version/event wrapper, but you get the idea
  {
    "__grouping__": 0b01,
    "a": "foo",
    "b": null,
    "views": 2 // some metric, doesn't really matter
  },
  {
    "__grouping__": 0b01,
    "a": "bar",
    "b": null,
    "views": 1
  },
  {
    "__grouping__": 0b10,
    "a": null,
    "b": "X",
    "views": 1
  },
  {
    "__grouping__": 0b10,
    "a": null,
    "b": "Y",
    "views": 2
  }
]

However, __grouping__ is 0b11 for all 4 results; since the "dimensions" a and b aren't used in any result (they're not dimensions, they're the output name for lookups). If I provide id to the grouping aggregator, its corresponding bit in the output will correctly be 0 for all rows, since it's used to generate both the a and b values, but this isn't helpful as I cannot differentiate which results are grouped by a, and which are grouped by b.

I propose that the grouping aggregator allow specifying outputName instead of dimension name, to align 1:1 with how subtotalsSpec works and allow for differentiating between output sub-groups in cases like that illustrated above.

gonzalezzfelipe commented 2 years ago

I'm facing this problem as well, and I'm not being able to find a work around it (using SQL). By doing a query that is like the following:

SELECT
  LOOKUP(id, 'lookup_name') AS foo,
  GROUPING(LOOKUP(id, 'lookup_name')) AS __grouping__,
  COUNT(*) AS "count"
FROM
  datasource
GROUP BY GROUPING SET((LOOKUP(id, 'lookup_name')), ())

You get all the resulting rows with 1 as grouping, which makes it impossible to identify the different grouping sets.

gonzalezzfelipe commented 2 years ago

TL;DR: For native queries, this can be solved by defining the dimensions that are not of type default as virtual columns, and treating them as regular dimensions. For SQL I wasn't able to find a direct solution.

I got deeper into the problem and found the following:

The GROUPING aggregator creates the grouping key here. This keyDimensions set is created here upon aggregation to check against the groupings. This is created by extracting the dimensions with the getDimension function. For lookup dimensions, for example, this refferences the dimension being passed into the lookup. As stated on the comments on that code, using outputName at the moment of aggregation is not possible, because aggregators are not aware of the output name of the output column names:

      // KeyDimensionNames are the input column names of dimensions. Its required since aggregators are not aware of the
      //  output column names.
      //  As we exclude certain dimensions from the result row, the value for any grouping_id aggregators have to change
      //  to reflect the new grouping dimensions, that aggregation is being done upon. We will mark the indices which have
      //  grouping aggregators and update the value for each row at those indices.

Why does it fail? Because the values inside the keyDimensions set do not correspond to the values that are added to the groupings list on the grouping aggregator. These only coincide for actual dimensions of the table, not transformations.

Nevertheless, there is a way out of the rabbit hole. As shown here, the way that the GROUPING SQL function is converted into a Native query, is by extracting the dimension or virtual column that is being grouped. As shown in the code:

    if (expression.isDirectColumnAccess()) {
      return expression.getDirectColumn();
    }

    String virtualColumn = virtualColumnRegistry.getOrCreateVirtualColumnForExpression(
        expression,
        node.getType()
    );
    return virtualColumn;

So it would seem that there would be no problem if one would be using virtual columns. Running an EXPLAIN PLAN FOR the query in the previous comment, shows that that SQL query is converted (roughly) to the following:

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "datasource"
  },
  "intervals": {
    "type": "intervals",
    "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [
    {
      "type": "extraction",
      "dimension": "id",
      "outputName": "foo",
      "outputType": "STRING",
      "extractionFn": {
        "type": "registeredLookup",
        "lookup": "lookup"
      }
    }
  ],
  "aggregations": [
    {
      "type": "grouping",
      "name": "__grouping__",
      "groupings": [
        "v0"
      ]
    },
    {
      "type": "count",
      "name": "count"
    }
  ],
  "postAggregations": [],
  "subtotalsSpec": [
    [
      "foo"
    ],
    []
  ]
}

Note that on the GROUPING aggregator, the query is referring to a v0 virtual column that is not added to the query. This is created on the virtualColumnRegistry.getOrCreateVirtualColumnForExpression on the snippet above, but for some reason that virtual column is not effectively added to the query. Later, when checking wether the corresponding dimension is being grouped on aggregator resolver, the v0 virtual column is not found and that's why we get a 1 instead of a 0 for those cases. Nevertheless, if we tweek that query to include the virtual column and address it as such, then the function works as intended.

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "datasource"
  },
  "intervals": {
    "type": "intervals",
    "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
  },
  "granularity": {
    "type": "all"
  },
  "virtualColumns": [
    {
      "type": "expression",
      "name": "foo",
      "expression": "LOOKUP(id, 'lookup')",
      "outputType": "STRING"
    }
  ]
  "dimensions": [
    {
      "type": "default",
      "dimension": "foo"
    }
  ],
  "aggregations": [
    {
      "type": "grouping",
      "name": "__grouping__",
      "groupings": [
        "foo"
      ]
    },
    {
      "type": "count",
      "name": "count"
    }
  ],
  "postAggregations": [],
  "subtotalsSpec": [
    [
      "foo"
    ],
    []
  ]
}

So it looks like a possible solution for the SQL is understanding why the virtual column v0 that is being "created" is not on the final native query, and making use of it.