crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
https://cratedb.com/database
Apache License 2.0
4.13k stars 568 forks source link

Support for individual indexing of OBJECTs in arrays (nested type) #11863

Open proddata opened 3 years ago

proddata commented 3 years ago

Use case:

As a user I want to be able to have information stored in arrays, but still be able to make a selection on logical combinations of individual object properties (e.g. I want to select all engineers that know "SQL" with a level of at least "3"):

CREATE TABLE engineers ( 
        id TEXT, 
        skills ARRAY(OBJECT AS( 
            type TEXT, 
            level INT 
        )) 
    ); 

INSERT INTO engineers (id,skills) VALUES ('123',[{"type" = 'Java', "level" = 4},{"type" = 'SQL', "level" = 2}]);
INSERT INTO engineers (id,skills) VALUES ('124',[{"type" = 'Python', "level" = 5},{"type" = 'SQL', "level" = 4}]);  

SELECT * FROM engineers;                                                                                                                                                                 
+-----+---------------------------------------------------------------+
| id  | skills                                                        |
+-----+---------------------------------------------------------------+
| 124 | [{"level": 5, "type": "Python"}, {"level": 4, "type": "SQL"}] |
| 123 | [{"level": 4, "type": "Java"}, {"level": 2, "type": "SQL"}]   |
+-----+---------------------------------------------------------------+

Example 1 - Array comparsion

To find all engineers that know "SQL" and have skill with a level of at least "3", today one can individually filter on each column using array comparisons:

SELECT id, skills 
    FROM engineers 
    WHERE 'SQL' = ANY(skills['type']) and 3 <= ANY(skills['level']);                                                                                                                          
+-----+---------------------------------------------------------------+
| id  | skills                                                        |
+-----+---------------------------------------------------------------+
| 124 | [{"level": 5, "type": "Python"}, {"level": 4, "type": "SQL"}] |
| 123 | [{"level": 4, "type": "Java"}, {"level": 2, "type": "SQL"}]   |
+-----+---------------------------------------------------------------+
--> returns all engineers with SQL knowledge and any skill of level 3 and up

Example 2 - Object comparison and one can use an object comparison, to find engineers that have a SQL skill of e.g. exactly level 2

SELECT id, skills  
    FROM engineers 
    WHERE {"type" = 'SQL', "level" = 2 } = ANY(skills);                                                                                                                                      
+-----+-------------------------------------------------------------+
| id  | skills                                                      |
+-----+-------------------------------------------------------------+
| 123 | [{"level": 4, "type": "Java"}, {"level": 2, "type": "SQL"}] |
+-----+-------------------------------------------------------------+
SELECT 1 row in set (0.025 sec)
--> returns all engineers with SQL knowledge of exactly  level 2

which however is using basically a table scan and should only be used together with another index.

Example 3 - UNNEST + sub-SELECTs

To find all engineers that know "SQL" with a level of at least "3" one would need to use unnest and sub-selects:

WITH skills AS (SELECT id, UNNEST(skills) AS unnested_skills FROM engineers), 
         sql_skills AS (SELECT id FROM skills WHERE unnested_skills['type'] = 'SQL' and unnested_skills['level'] >=3) 
    SELECT * 
    FROM engineers 
    WHERE id IN (SELECT id FROM sql_skills);                                                                                                                                                 
+-----+---------------------------------------------------------------+
| id  | skills                                                        |
+-----+---------------------------------------------------------------+
| 124 | [{"level": 5, "type": "Python"}, {"level": 4, "type": "SQL"}] |
+-----+---------------------------------------------------------------+

Too make it faster, one should probably also add a filter on the original array columns:

WITH skills AS (SELECT id, UNNEST(skills) AS unnested_skills FROM engineers WHERE 'SQL' = ANY(skills['type']) and 3 <= ANY(skills['level'])), 
         sql_skills AS (SELECT id FROM skills WHERE unnested_skills['type'] = 'SQL' and unnested_skills['level'] >=3) 
    SELECT * 
    FROM engineers 
    WHERE id IN (SELECT id FROM sql_skills);                                                                                                                                                 
+-----+---------------------------------------------------------------+
| id  | skills                                                        |
+-----+---------------------------------------------------------------+
| 124 | [{"level": 5, "type": "Python"}, {"level": 4, "type": "SQL"}] |
+-----+---------------------------------------------------------------+

which makes the query itself rather complex.


As of now, one could only reduce the search space by using ANY on the individual columns, but can't actually represent a logical combination in the selection. On top of that one needs to use a UNNEST which makes querying rather expensive.

Feature description: see use case / t.b.d.

Maybe something like https://www.monterail.com/blog/how-to-index-objects-elasticsearch

mfussenegger commented 3 years ago

If I understand this correctly it sounds a bit like a use-case for jsonpath support: https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH

proddata commented 3 years ago

If I understand this correctly it sounds a bit like a use-case for jsonpath support: https://www.postgresql.org/docs/12/functions-json.html#FUNCTIONS-SQLJSON-PATH

I think it is/could be more than "just" an operator, but rather also a different indexing strategy.

Afaik right now the index points on e.g. skills['type'] to a document id. In order to quickly select on a specific element in an array, the index would need to point to an element within the array of a document.

e.g.

id skills
1 [ { "type": "SQL" , "level": 4}]
2 [ { "type": "SQL" , "level": 2}, { "type": "Java" , "level": 4}]

Right now the index on ['type'] and ['level'] point to the document, but they also could point to the individual array element.