cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.02k stars 3.79k forks source link

GROUP BY support for jsonb array values #35848

Closed JorritSalverda closed 3 years ago

JorritSalverda commented 5 years ago

I'm trying to do a group by query using values in a JSONB array. What I'm trying to figure out what features still need to be implemented to do what's show below and what alternative to use in the meantime.

CREATE TABLE computed_pipelines (
  id INT PRIMARY KEY DEFAULT unique_rowid(),
  labels JSONB NULL
);

INSERT INTO computed_pipelines
  (labels),
  ('[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]');
INSERT INTO computed_pipelines
  (labels),
  ('[{"key": "group", "value": "a"}, {"key": "group", "value": "c"}]');
INSERT INTO computed_pipelines
  (labels),
  ('[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]');

SELECT
  l->>'key' as key,
  l->>'value' as value,
  COUNT(DISTINCT c.id) as nr_computed_pipelines
FROM
  computed_pipelines c,
  jsonb_array_elements (c.labels) l
GROUP BY
  key,
  value
HAVING
  nr_computed_pipelines > 1
ORDER BY
  nr_computed_pipelines DESC,
  key,
  value;

Describe the solution you'd like

I'd like the select query above to return the following result:

  key   | value | nr_computed_pipelines
+-------+-------+-----------------------+
  group | a     | 3
  group | b     | 2

Instead it returns no data source matches prefix: c which is due to correlated subqueries not being supported yet, see https://github.com/cockroachdb/cockroach/issues/24676

Describe alternatives you've considered

Just retrieving the values from the array without grouping works fine with

SELECT
  jsonb_array_elements (c.labels)->>'key' as key,
  jsonb_array_elements (c.labels)->>'value' as value
FROM
  computed_pipelines c;

As soon as I try

SELECT
  jsonb_array_elements (c.labels)->>'key' as key,
  jsonb_array_elements (c.labels)->>'value' as value
FROM
  computed_pipelines c
GROUP BY
  key,
  value;

It returns column "key" does not exist. I think (partially) due to the fact that using aliases in group by, etc isn't supported yet. See https://github.com/cockroachdb/cockroach/issues/28059.

And if I try the following to work around this

SELECT
  jsonb_array_elements (c.labels)->>'key' as key,
  jsonb_array_elements (c.labels)->>'value' as value
FROM
  computed_pipelines c
GROUP BY
  jsonb_array_elements (c.labels)->>'key',
  jsonb_array_elements (c.labels)->>'value';

It fails with jsonb_array_elements(): generator functions are not allowed in GROUP BY. On this topic I found closed issue https://github.com/cockroachdb/cockroach/issues/10520.

An alternative I though about as well is to store the result of the non-grouped query in a temporary table, but for that https://github.com/cockroachdb/cockroach/issues/5807 needs to be implemented.

Are there any other features that need to be implemented before the original select query is possible? And what are the alternatives currently? Right now I don't group, count and sort in my query, but have to do it in code instead.

awoods187 commented 5 years ago

I just confirmed that this does not work on master either. The inserts above didn't work for me until I modified them:

INSERT
INTO
    computed_pipelines (labels)
VALUES
    (
        '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]'
    );
INSERT
INTO
    computed_pipelines (labels)
VALUES
    (
        '[{"key": "group", "value": "a"}, {"key": "group", "value": "c"}]'
    );
INSERT
INTO
    computed_pipelines (labels)
VALUES
    (
        '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]'
    );

What's happening here @justinj ?

JorritSalverda commented 5 years ago

Oh sorry, I didn't test the insert query, just did them from the top of my head. I removed the incorrect braces. But can probably be merged into the following:

INSERT INTO
  computed_pipelines (labels)
VALUES
  (
    '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]',
    '[{"key": "group", "value": "a"}, {"key": "group", "value": "c"}]',
    '[{"key": "group", "value": "a"}, {"key": "group", "value": "b"}]'
  );

But the inserts themselves are not really the issue here, just included them to illustrate my point :)

justinj commented 5 years ago

The problem is what @knz describes in the linked issue which is more about the implicit lateral syntax than it is correlated subqueries.

Here's a somewhat gnarly query that I think does what you want:

SELECT
    key, value, nr_computed_pipelines
FROM
    (
        SELECT
            key, value, count(DISTINCT id) AS nr_computed_pipelines
        FROM
            (
                SELECT
                    l->>'key' AS key, l->>'value' AS value, id
                FROM
                    (SELECT *, jsonb_array_elements(labels) AS l FROM computed_pipelines)
            )
        GROUP BY
            key, value
    )
WHERE
    nr_computed_pipelines > 1
ORDER BY
    nr_computed_pipelines DESC, key, value

You're right that #28059 is also a blocker to making this query a bit nicer.

awoods187 commented 5 years ago

Here is the lateral issue for tracking purposes https://github.com/cockroachdb/cockroach/issues/24560

JorritSalverda commented 5 years ago

Cool the nasty query works! For making sure i'm actually unnesting an array I need to check the jsonb type, so the most inner select query becomes:

(SELECT id, jsonb_array_elements(labels) AS l FROM computed_pipelines where jsonb_typeof(labels) = 'array')

Should we keep this open until the dependencies are implemented? Or close it so it doesn't bloat your backlog?

knz commented 5 years ago

I think at the very least we should integrate this query into the QA / unit test suite when the dependencies are implemented. I'm not sure how to best do this -- should the issue remain open so that the future implementor can look at it? @jordanlewis please advise.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

jordanlewis commented 3 years ago

The limitation here has been lifted. The OP's query still doesn't work, but that's because you're not allowed to reference an aliased group by column in a having clause even in other db's like postgres.