apache / age

Graph database optimized for fast analysis and real-time data processing. It is provided as an extension to PostgreSQL.
https://age.apache.org
Apache License 2.0
3.06k stars 409 forks source link

Research alternative of graphid #1021

Closed rafsun42 closed 8 months ago

rafsun42 commented 1 year ago

Approach 1 - Using a trimmed and indexed version of label table for join

The query

Cypher query that extracts label ID (see the QPT just below):

SELECT * FROM cypher('imdb',
$$
    EXPLAIN MATCH (p:Person{primaryName:'Christian Bale'})-[e:IS_IN]->(:Title) RETURN e
$$) as (a agtype);
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Gather  (cost=457881.41..742804.15 rows=105 width=32)
   Workers Planned: 2
   ->  Parallel Hash Join  (cost=456881.41..741793.65 rows=44 width=32)
         Hash Cond: (e.start_id = p.id)
         ->  Parallel Seq Scan on "IS_IN" e  (cost=0.00..284748.30 rows=43610 width=29)
               Filter: ((_extract_label_id(end_id))::integer = 3)
         ->  Parallel Hash  (cost=456815.59..456815.59 rows=5266 width=8)
               ->  Parallel Seq Scan on "Person" p  (cost=0.00..456815.59 rows=5266 width=8)
                     Filter: (properties @> agtype_build_map('primaryName'::text, '"Christian Bale"'::agtype))
(9 rows)

My goal is to replace the use of _extract_label_id in the Filter node. This line is filtering out edges where end node is not a Title.

Building the solution that does not extract label ID

My solution is to create a trimmed and indexed table of Title table. Calling it Title_hash. It has only ID column and indexed by hash method.

CREATE SCHEMA imdb_dev;
CREATE TABLE imdb_dev."Title_hash" (id graphid); -- id will be BigSerial when implemented in C
CREATE INDEX imdb_dev_title_hash_id ON imdb_dev."Title_hash" USING hash (id);
INSERT INTO imdb_dev."Title_hash" SELECT id FROM imdb."Title"; -- load the new table

Query on the new solution

The SQL query that uses the new table: (It is equivalent to the above cypher query.)

EXPLAIN
SELECT *
FROM imdb."IS_IN" e
    JOIN imdb."Person" p
    ON e.start_id = p.id
WHERE
    (p.properties @> agtype_build_map('primaryName'::text, '"Christian Bale"'::agtype))
    AND
    e.end_id IN (SELECT id FROM imdb_dev."Title_hash")
;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Gather  (cost=457881.41..745430.91 rows=20933 width=238)
   Workers Planned: 2
   ->  Nested Loop Semi Join  (cost=456881.41..742337.61 rows=8722 width=238)
         ->  Parallel Hash Join  (cost=456881.41..730763.53 rows=8722 width=238)
               Hash Cond: (e.start_id = p.id)
               ->  Parallel Seq Scan on "IS_IN" e  (cost=0.00..241138.20 rows=8722020 width=29)
               ->  Parallel Hash  (cost=456815.59..456815.59 rows=5266 width=209)
                     ->  Parallel Seq Scan on "Person" p  (cost=0.00..456815.59 rows=5266 width=209)
                           Filter: (properties @> agtype_build_map('primaryName'::text, '"Christian Bale"'::agtype))
         ->  Index Scan using imdb_dev_title_hash_id on "Title_hash"  (cost=0.00..1.32 rows=1 width=8)
               Index Cond: (id = e.end_id)
(11 rows)

Rationale

Because Title_hash has less data per row and it is indexed, joining it would be faster than joining with the original Title table. The cost of these query is almost similar.

panosfol commented 1 year ago

I was searching the get_label_name() function and this idea of a junction table also works there. I would need to modify the function in 2 ways:

After that the function will continue its flow as normal since the only difference is how we obtain the label_id.

@rafsun42 Should I go along with these changes by making a temporary table via SQL commands to test my code?

panosfol commented 1 year ago

Im in the process of a creating a junction table to hold just the id and the label_id of a given vertex. The table is created at the creation of the graph under the <graph_name> schema. I have successfully created the table with its columns but from what I understand every vertex that is inserted in the graph should also be inserted in the junction table, and Im having a little trouble with the inheritance system. If I understand it correctly the inheritance system is the reason that when a vertex is added to a label table, its also added to the ag_label_vertex table. Therefore I believe that the right course of action is for the junction table to have the ag_label_vertex table as a parent and then to alter the table through the postgres API to remove the unnecessary columns.

The problem with that is that I dont have access to the ag_label_vertex table at the creation of the graph, so I can pass it as a list for the parents argument to the function that creates the junction table. Is this the correct way to tackle this? How should i proceed?

rafsun42 commented 1 year ago

@panosfol I am not sure if that's how postgres inheritance work. A row is not added to both child and parent table. Postgres has a documentation on it.

panosfol commented 1 year ago

@rafsun42 According to the documentation it doesn't and I got a bit confused there. I can't pinpoint exactly how in the code the vertex is inserted in both the label_table and the ag_label_vertex at the same time, I assumed it had to do with inheritance.

rafsun42 commented 1 year ago

@panosfol Postgres' FROM ONLY clause may give you some hint.

panosfol commented 1 year ago

@rafsun42 ok thank you I will look into that!

rafsun42 commented 1 year ago

@panosfol The documentation I was referring to before states:

Inheritance does not automatically propagate data from INSERT or COPY commands to other tables in the inheritance hierarchy.

That is the reason I was concerned if inheritance would work well or not.

So, we have to do this without using inheritance. Create the junction table independently with foreign key constraint on both columns. Insert into both vertex table and junction table.

Any pros and cons of not using inheritance?

panosfol commented 1 year ago

@rafsun42 well a pretty obvious con is that we have to insert all the entries in the junction table so its going to double the time of inserting and updating. Other than that I don't see another issue. The benefit is that it would work with changing the get_label_name function (and maybe it would work better with other changes).

Ill push a commit fixing the creation of the junction table and then Ill look into how we can insert the vertices in the junction table also. Ill look into the create_vertex function.

panosfol commented 1 year ago

@rafsun42 So there are 2 issues that im facing now. First there isn't a function in the cache that takes just the OID and returns the graph_name, only the other way around. I have for now hard coded the graph name that Im using so I can test things.

The other issue is that the server crashes when I try to insert the tuple in the junction table. I went to the create_vertex function in cypher_create.c:476 and using the function create_entity_result_rel_info() I'm successfully creating the ResultRelInfo for the junction table. After that Im duplicating the insert_entity_tuple and Im using the same elemTupleSlot and the ResultRelInfo from the junction table. But the server crashes because the elemTupleSlot was created with 3 columns in mind and the junction table only has 2. I tried to copying the contents of the node->elemTupleSlot to a new variable but I get segmentation fault. It seems that the tuple being created for that specific table is causing problems.

I will look into it more but it seems its a bit more complicated than I thought. Ill try to have it ready by tomorrow.

panosfol commented 1 year ago

@rafsun42 I have found a Postgres function that returns the Namespacename given the OID so one problem is solved. But i dont think its the right direction to just insert the tuple twice with no prep, the tuple holds crucial info about the table that is being insered to, its not as simple as just choosing to insert it in another table.

panosfol commented 1 year ago

@rafsun42 Ok so on further inspection the match clause also wont work without some changes because the patter is different if i create a new vertex for the junction table. I need to know how to move forward. Should we create the new vertex and make the necessary changes or try to just insert the tuple at the execution stage? Inserting the tuple by itself is complicated and Im not even sure if its doable to be honest