Open justinj opened 6 years ago
This is a feature we use in PostgreSQL, that we will need to migrate to CockroachDB.
Support for JSON path is essential for wildcard searches. Here's an example using movies.json, converted to hold one object per line via this command:
perl -i -pe 's/\]\},/]}\n/g' movies.json
Load it up into Postgres:
CREATE TABLE MOVIES (id SERIAL PRIMARY KEY, movie JSONB NOT NULL);
\copy movies(movie) from PROGRAM 'sed "s|\\\\|\\\\\\\\|g" movies.json';
CREATE INDEX ON movies USING GIN (movie jsonb_path_ops);
If I want a list of all movies where Carrie Fisher is a cast member:
try=# select id from movies where movie @@ '$.cast[*] == "Carrie Fisher"';
id
-------
47778
48063
48369
48510
48667
48921
49013
49510
49799
50103
50330
50555
50781
51001
55509
57340
(16 rows)
This is because the cast
field is an array, and cast members can appear anywhere in the array. This is simply not possible using @>
, because it supports no wildcard search.
The closest equivalent I can get in Cockroach is:
CREATE TABLE MOVIES (id SERIAL PRIMARY KEY, movie JSONB NOT NULL);
\copy movies(movie) from 'movies.json';
CREATE INDEX ON movies USING GIN (movie);
A couple of differences here from the Postgres example:
jsonb_path_ops
jsonb indexing (yet)No big deal, but as there are no wildcards, we can search only one at a time:
select id from movies where movie->'cast'->>0 = 'Carrie Fisher';
id
--------------------
772585567540477953
Now we can work around this by putting the array in an ANY()
expression; this returns the expected 16 rows:
try=> select id from movies where 'Carrie Fisher' = ANY(ARRAY(SELECT jsonb_array_elements_text(movie->'cast')));
id
--------------------
772585566846091265
772585567098601473
772585567355633665
772585567540477953
772585567631114241
772585567923175425
772585568023576577
772585568487866369
772585568738705409
772585569013727233
772585569180876801
772585569342947329
772585569525989377
772585569765097473
772585573808635905
772585575588462593
However it cannot use the GIN index:
try=> explain analyze select id from movies where 'Carrie Fisher' = ANY(ARRAY(SELECT jsonb_array_elements_text(movie->'cast')));
info
---------------------------------------------------------------------------------------------------
planning time: 9ms
execution time: 35.9s
distribution: local
vectorized: true
rows read from KV: 30,395 (5.3 MiB)
cumulative time spent in KV: 3.4s
maximum memory usage: 309 MiB
network usage: 0 B (0 messages)
• filter
│ nodes: n1
│ actual row count: 16
│ estimated row count: 10,090
│ filter: 'Carrie Fisher' = ANY COALESCE(array_agg, ARRAY[])
│
└── • render
│ nodes: n1
│ actual row count: 30,395
│ estimated row count: 30,269
│
└── • group
│ nodes: n1
│ actual row count: 30,395
│ estimated row count: 30,269
│ group by: id
│
└── • apply join (left outer)
│ nodes: n1
│ actual row count: 86,511
│ estimated row count: 303,950
│
└── • scan
nodes: n1
actual row count: 30,395
KV time: 3.4s
KV contention time: 0µs
KV rows read: 30,395
KV bytes read: 5.3 MiB
estimated row count: 30,395 (100% of the table; stats collected 31 minutes ago)
table: movies@primary
spans: FULL SCAN
(41 rows)
Time: 36067.055 ms (00:36.067)
Contrast that table scan with this plan from Postgres using jsonpath:
try=# explain analyse select id from movies where movie @@ '$.cast[*] == "Carrie Fisher"'; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on movies (cost=12.02..23.46 rows=3 width=4) (actual time=0.054..0.176 rows=16 loops=1)
Recheck Cond: (movie @@ '($."cast"[*] == "Carrie Fisher")'::jsonpath)
Heap Blocks: exact=16
-> Bitmap Index Scan on movies_movie_idx1 (cost=0.00..12.02 rows=3 width=0) (actual time=0.033..0.033 rows=16 loops=1)
Index Cond: (movie @@ '($."cast"[*] == "Carrie Fisher")'::jsonpath)
Planning Time: 1.109 ms
Execution Time: 0.253 ms
Aaaaaand I just figured out that @>
does the trick:
try=> select id from movies where movie @> '{"cast":["Carrie Fisher"]}';
id
--------------------
772585566846091265
772585567098601473
772585567355633665
772585567540477953
772585567631114241
772585567923175425
772585568023576577
772585568487866369
772585568738705409
772585569013727233
772585569180876801
772585569342947329
772585569525989377
772585569765097473
772585573808635905
772585575588462593
(16 rows)
Time: 132.415 ms
try=> explain analyze select id from movies where movie @> '{"cast":["Carrie Fisher"]}';
info
-----------------------------------------------------------------------------
planning time: 14ms
execution time: 15ms
distribution: local
vectorized: true
rows read from KV: 16 (992 B)
cumulative time spent in KV: 10ms
maximum memory usage: 20 KiB
network usage: 0 B (0 messages)
• scan
nodes: n1
actual row count: 16
KV time: 10ms
KV contention time: 0µs
KV rows read: 16
KV bytes read: 992 B
estimated row count: 0 (<0.01% of the table; stats collected 2 hours ago)
table: movies@movies_movie_idx
spans: 1 span
(19 rows)
Time: 38.905 ms
So much for the array wildcard argument!
Spec is available here. There's a good blog post covering the feature in Postgres here. I have an incomplete and non-Cockroach-compatible implementation here which includes a goyacc parser, which might be helpful.
I think this could potentially make a decent intern project, in terms of size and scope. It is not a feature of Postgres 10 but will be a feature of Postgres 11.
Jira issue: CRDB-5852 Epic: CRDB-12464
gz#18507