cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.57k stars 3.71k forks source link

sql: support JSONPath #22513

Open justinj opened 6 years ago

justinj commented 6 years ago

Spec is available here. There's a good blog post covering the feature in Postgres here. I have an incomplete and non-Cockroach-compatible implementation here which includes a goyacc parser, which might be helpful.

I think this could potentially make a decent intern project, in terms of size and scope. It is not a feature of Postgres 10 but will be a feature of Postgres 11.

Jira issue: CRDB-5852 Epic: CRDB-12464

gz#18507

brendesp commented 3 years ago

This is a feature we use in PostgreSQL, that we will need to migrate to CockroachDB.

theory commented 2 years ago

Support for JSON path is essential for wildcard searches. Here's an example using movies.json, converted to hold one object per line via this command:

perl -i -pe 's/\]\},/]}\n/g' movies.json

Load it up into Postgres:

CREATE TABLE MOVIES (id SERIAL PRIMARY KEY, movie JSONB NOT NULL);
\copy movies(movie) from PROGRAM 'sed "s|\\\\|\\\\\\\\|g" movies.json';
CREATE INDEX ON movies USING GIN (movie jsonb_path_ops);

If I want a list of all movies where Carrie Fisher is a cast member:

try=# select id from movies where movie @@ '$.cast[*] == "Carrie Fisher"';
  id   
-------
 47778
 48063
 48369
 48510
 48667
 48921
 49013
 49510
 49799
 50103
 50330
 50555
 50781
 51001
 55509
 57340
(16 rows)

This is because the cast field is an array, and cast members can appear anywhere in the array. This is simply not possible using @>, because it supports no wildcard search.

The closest equivalent I can get in Cockroach is:

CREATE TABLE MOVIES (id SERIAL PRIMARY KEY, movie JSONB NOT NULL);
\copy movies(movie) from 'movies.json';
CREATE INDEX ON movies USING GIN (movie);

A couple of differences here from the Postgres example:

No big deal, but as there are no wildcards, we can search only one at a time:

select id from movies where movie->'cast'->>0 = 'Carrie Fisher';
         id         
--------------------
 772585567540477953

Now we can work around this by putting the array in an ANY() expression; this returns the expected 16 rows:

try=> select id from movies where 'Carrie Fisher' = ANY(ARRAY(SELECT jsonb_array_elements_text(movie->'cast')));
         id         
--------------------
 772585566846091265
 772585567098601473
 772585567355633665
 772585567540477953
 772585567631114241
 772585567923175425
 772585568023576577
 772585568487866369
 772585568738705409
 772585569013727233
 772585569180876801
 772585569342947329
 772585569525989377
 772585569765097473
 772585573808635905
 772585575588462593

However it cannot use the GIN index:

try=> explain analyze select id from movies where 'Carrie Fisher' = ANY(ARRAY(SELECT jsonb_array_elements_text(movie->'cast')));
                                               info                                                
---------------------------------------------------------------------------------------------------
 planning time: 9ms
 execution time: 35.9s
 distribution: local
 vectorized: true
 rows read from KV: 30,395 (5.3 MiB)
 cumulative time spent in KV: 3.4s
 maximum memory usage: 309 MiB
 network usage: 0 B (0 messages)

 • filter
 │ nodes: n1
 │ actual row count: 16
 │ estimated row count: 10,090
 │ filter: 'Carrie Fisher' = ANY COALESCE(array_agg, ARRAY[])
 │
 └── • render
     │ nodes: n1
     │ actual row count: 30,395
     │ estimated row count: 30,269
     │
     └── • group
         │ nodes: n1
         │ actual row count: 30,395
         │ estimated row count: 30,269
         │ group by: id
         │
         └── • apply join (left outer)
             │ nodes: n1
             │ actual row count: 86,511
             │ estimated row count: 303,950
             │
             └── • scan
                   nodes: n1
                   actual row count: 30,395
                   KV time: 3.4s
                   KV contention time: 0µs
                   KV rows read: 30,395
                   KV bytes read: 5.3 MiB
                   estimated row count: 30,395 (100% of the table; stats collected 31 minutes ago)
                   table: movies@primary
                   spans: FULL SCAN
(41 rows)

Time: 36067.055 ms (00:36.067)

Contrast that table scan with this plan from Postgres using jsonpath:

try=# explain analyse select id from movies where movie @@ '$.cast[*] == "Carrie Fisher"';                                                               QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on movies  (cost=12.02..23.46 rows=3 width=4) (actual time=0.054..0.176 rows=16 loops=1)
   Recheck Cond: (movie @@ '($."cast"[*] == "Carrie Fisher")'::jsonpath)
   Heap Blocks: exact=16
   ->  Bitmap Index Scan on movies_movie_idx1  (cost=0.00..12.02 rows=3 width=0) (actual time=0.033..0.033 rows=16 loops=1)
         Index Cond: (movie @@ '($."cast"[*] == "Carrie Fisher")'::jsonpath)
 Planning Time: 1.109 ms
 Execution Time: 0.253 ms
theory commented 2 years ago

Aaaaaand I just figured out that @> does the trick:

try=> select id from movies where movie @> '{"cast":["Carrie Fisher"]}';
         id         
--------------------
 772585566846091265
 772585567098601473
 772585567355633665
 772585567540477953
 772585567631114241
 772585567923175425
 772585568023576577
 772585568487866369
 772585568738705409
 772585569013727233
 772585569180876801
 772585569342947329
 772585569525989377
 772585569765097473
 772585573808635905
 772585575588462593
(16 rows)

Time: 132.415 ms
try=> explain analyze select id from movies where movie @> '{"cast":["Carrie Fisher"]}';
                                    info                                     
-----------------------------------------------------------------------------
 planning time: 14ms
 execution time: 15ms
 distribution: local
 vectorized: true
 rows read from KV: 16 (992 B)
 cumulative time spent in KV: 10ms
 maximum memory usage: 20 KiB
 network usage: 0 B (0 messages)

 • scan
   nodes: n1
   actual row count: 16
   KV time: 10ms
   KV contention time: 0µs
   KV rows read: 16
   KV bytes read: 992 B
   estimated row count: 0 (<0.01% of the table; stats collected 2 hours ago)
   table: movies@movies_movie_idx
   spans: 1 span
(19 rows)

Time: 38.905 ms

So much for the array wildcard argument!