apache / age

Graph database optimized for fast analysis and real-time data processing. It is provided as an extension to PostgreSQL.
https://age.apache.org
Apache License 2.0
3.11k stars 410 forks source link

Apache AGE takes a long time to create graph node #1287

Closed I33Buckler closed 5 months ago

I33Buckler commented 1 year ago

I am in the process of creating a graph node from a database table that has 7.6 million lines. The process runs a function to process the table into the graph with loops. Initial run of the process it was still running after nearly 24 hours.
To test the loop I ran it without loading into graph. The result is nearly 5 minutes of execution. The code for creating the graph node below. Removing the comments from from the section with the PERFORM statement results in the long process. Wondering if there is something missing from the process.

--create graph 
--select * from drop_graph('transport_network',true);
--SELECT create_graph('transport_network');

-- function to test for existence of graph node 
drop function if exists node_check;
create or replace function node_check (graph_name text,node_name text)
returns text
LANGUAGE plpgsql
as $node_check$
DECLARE 
    table_exists boolean;
    node_status text;
    test text;
begin
drop table if exists table_test;
-- Check for existence of  node 
execute format('create temp table table_test as SELECT COUNT(*) as existence FROM information_schema.tables WHERE lower(table_schema) = lower(%1$s) 
                                    AND lower(table_type) = ''base table'' and lower(table_name) = lower(%2$s);',
                format('''%s''',graph_name),format('''%s''',node_name));
-- create or empty node
select existence != 0 from table_test into table_exists;  
if table_exists then 
node_status = 'Node exists';
execute format('select * from cypher(''%1s'',$$match (v:%2$s) detach delete v$$) as (v agtype);',graph_name,node_name)
                    ;
else 
node_status = 'Node created';
execute format('select create_vlabel(%1$s,%2$s);',
                format('''%s''',graph_name),format('''%s''',node_name));
end if;
drop table if exists table_test;
return node_status;
END;
$node_check$;

-- Function for creating graph node from existing table 
drop function if exists create_stop_times;
CREATE OR REPLACE FUNCTION create_stop_times(graph_name text, tc_date date, trip_id text, stop_id text, 
                        arrival_time text, arrival_time_sec bigint, departure_time text, departure_time_sec bigint,
                        stop_sequence bigint, pickup_type int, drop_off_type int)
returns text 
LANGUAGE plpgsql
VOLATILE
as $stop_times$
declare 
    nodename text := graph_name || '.' || 'stop_times';
BEGIN
execute 
format ('select * from 
            cypher(''%1$s'', 
                $$match (t:trips),(s:stops)
                         where t.id=%2$s and s.id=%3$s and t.tc_date=%11$s and t.tc_date=s.tc_date
                    create(t)<-
                    [:PART_OF_TRIP]-
                    (st:stop_times {tc_date: %11$s, arrival_time: %4$s, arrival_time_seconds: %5$s, departure_time: %6$s, departure_time_seconds: %7$s, 
                            stop_sequence: %8$s, pickup_type: %9$s, drop_off_type: %10$s})-
                    [:LOCATED_AT]->(s)$$) as (st agtype);',
            quote_ident(graph_name),quote_ident(trip_id),quote_ident(stop_id),
                    quote_ident(arrival_time),to_char(arrival_time_sec,'999999'),
                    quote_ident(departure_time),to_char(departure_time_sec,'999999'),
                    to_char(stop_sequence,'9999'),to_char(pickup_type,'9'),to_char(drop_off_type,'9'),
                    format('"%s"',to_char(tc_date,'yyyy-mm-dd')));
return nodename;
END
$stop_times$
;

-- Check for node. Reset if exists 
select node_check('transport_network','stop_times');

-- Initialise tables 
drop table if exists stop_times, trip_ranks;

-- Stop times for processing by trip ;
create temp table stop_times as 
select 
dense_rank() over (order by tc_date,trip_id) as trip_rank,
row_number() over (partition by tc_date,trip_id order by stop_sequence) as row_num,* from allservices.stop_times ;
create index on stop_times (trip_rank);
create index on stop_times (row_num);
create unique index on stop_times (trip_rank,row_num);
create unique index on stop_times (trip_rank,stop_sequence);

-- individual trips for processing stop times
create temp table trip_ranks as select distinct trip_rank from stop_times;
create unique index on trip_ranks (trip_rank);

-- define table for processing in loop 
drop table if exists stop_times_loop;
create temp table stop_times_loop (like stop_times);

do $$
declare temprow record;
        triprank record;
        graph_name text:='transport_network';
        row_counter integer := 0 ;
        rank_counter integer := 0;
begin
-- Loop through trips ranked by TC_Date
for triprank in select trip_rank from trip_ranks order by trip_rank
loop 
rank_counter := rank_counter+1;
row_counter := 0; -- reset row counter for looping through rows in trip by date
for temprow in select row_num from stop_times where trip_rank=rank_counter order by row_num
loop
row_counter := row_counter+1; 
-- Insert trip by date and associated rows for stop sequence on trip
truncate table stop_times_loop;
insert into stop_times_loop select * from stop_times where trip_rank=rank_counter and row_num=row_counter;
--perform create_stop_times(graph_name,
--                                  a.tc_date,a.trip_id,a.stop_id,
--                                  a.arrival_time,a.arrival_time_sec,
--                                  a.departure_time,a.departure_time_sec,
--                                  a.row_num,a.pickup_type,a.drop_off_type)
--      from stop_times_loop a
--;
end loop;
end loop;
end; $$
;
rafsun42 commented 1 year ago

@I33Buckler AGE has utilities to load graph from CSV. Can it be used instead of plpgsql functions? https://age.apache.org/age-manual/master/intro/agload.html

Also, are you using the latest master? A patch on detach delete was merged recently.

I33Buckler commented 1 year ago

@I33Buckler AGE has utilities to load graph from CSV. Can it be used instead of plpgsql functions? https://age.apache.org/age-manual/master/intro/agload.html

Rather limiting for recommending a database extension to process the table by exporting to CSV and processing it outside the database. Expect to be able to process the database directly. Found this page a useful reference. https://stackoverflow.com/questions/75178525/is-it-possible-to-create-a-graph-in-age-using-existing-table-in-the-database The reason for the looping is that when I processed a table as a direct read the function completed without reading the entire table into graph. Note this was developed on smaller tables. The next smaller table to Stop_Times has abut 35,000 rows.

Also, are you using the latest master? A patch on detach delete was merged recently. Thank you for the reminder. I have pulled the updates from the repository and rebuilt the module. The reported version is 1.4.0 Note that detach delete is only called once to initialise the graph node before loading in the contents for the loop. This was added for removing nodes that were created as a result of testing specific sections of the table AllServices.Stop_Times. Process is still taking a long time to run.

jrgemignani commented 1 year ago

@I33Buckler Your create_stop_times function is O(N^3) and that is inside a doubly nested loop that also does SELECTs from tables for each level. This can lead to as much as O(N^5) (possibly more) runtime for the whole process.

Even without considering the 2 external loops which could be O(N^2), just the O(N^3) for create_stop_times is bad enough to make what you are doing only viable for small datasets.

My suggestion is that you need to rewrite your code to make it more efficient. At this moment, this does not appear to be an Apache AGE issue.

If you have additional questions or need some guidance, please do ask and we will try to help where possible.

I33Buckler commented 1 year ago

@I33Buckler Your create_stop_times function is O(N^3) and that is inside a doubly nested loop that also does SELECTs from tables for each level. This can lead to as much as O(N^5) (possibly more) runtime for the whole process.

Even without considering the 2 external loops which could be O(N^2), just the O(N^3) for create_stop_times is bad enough to make what you are doing only viable for small datasets.

My suggestion is that you need to rewrite your code to make it more efficient. At this moment, this does not appear to be an Apache AGE issue.

If you have additional questions or need some guidance, please do ask and we will try to help where possible.

Thank you for the response. Very helpful. One of the reasons for looping was that the running the query on the table without looping resulted in it failing with the following messages in the postgresql log. Note that the process is running directly on the server.

023-10-26 23:13:45.675 AEDT [44900] LOG:  checkpoint starting: time
2023-10-26 23:14:59.530 AEDT [44900] LOG:  checkpoint complete: wrote 725 buffers (0.1%); 0 WAL file(s) added, 0 removed, 1 recycled; write=72.978 s, sync=0.509 s, total=73.855 s; sync files=79, longest=0.089 s, average=0.007 s; distance=9734 kB, estimate=23911 kB
2023-10-26 23:16:36.209 AEDT [45561] matthew@tcanalysis LOG:  could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.210 AEDT [45558] matthew@tcanalysis LOG:  could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.210 AEDT [45559] matthew@tcanalysis LOG:  could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.210 AEDT [45572] matthew@tcanalysis LOG:  could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.212 AEDT [45560] matthew@tcanalysis LOG:  could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.212 AEDT [45562] matthew@tcanalysis LOG:  could not receive data from client: Connection reset by peer
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis LOG:  could not send data to client: Broken pipe
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis STATEMENT:  select create_stop_times('transport_network',
                                        a.tc_date,a.trip_id,a.stop_id,
                                        a.arrival_time,a.arrival_time_sec,
                                        a.departure_time,a.departure_time_sec,
                                        a.row_num,a.pickup_type,a.drop_off_type)
    from stop_times a
    ;
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis FATAL:  connection to client lost
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis STATEMENT:  select create_stop_times('transport_network',
                                        a.tc_date,a.trip_id,a.stop_id,
                                        a.arrival_time,a.arrival_time_sec,
                                        a.departure_time,a.departure_time_sec,
                                        a.row_num,a.pickup_type,a.drop_off_type)
    from stop_times a
    ;
2023-10-26 23:18:45.540 AEDT [44900] LOG:  checkpoint starting: time

I've partitioned the table by tc_date processing by looping through each tc_date which hopefully will improve the timing. The result is 23 separate steps for each tc_date.

I33Buckler commented 1 year ago

A factor that I've noticed with regard to the noticeable performance difference between joining the PostgreSQL tables in the database and and joining the nodes is the difference in estimated cost. The native query below in response to EXECUTION PLAN returns a total cost score of 296852.63.

select st.* from 
allservices.stops s,
allservices.trips t,
allservices.stop_times st 
where s.tc_date=t.tc_date and t.tc_date=st.tc_date
        and s.stop_id=st.stop_id and t.trip_id=st.trip_id 
    ;

Whereas the graph query in response to the EXECUTION PLAN returns a total cost score of 2115202.70

CREATE OR REPLACE FUNCTION create_stop_times(graph_name text, tc_date date, trip_id text, stop_id text, 
                        arrival_time text, arrival_time_sec bigint, departure_time text, departure_time_sec bigint,
                        stop_sequence bigint, pickup_type int, drop_off_type int)
returns text 
LANGUAGE plpgsql
VOLATILE
as $stop_times$
declare 
    nodename text := graph_name || '.' || 'stop_times';
BEGIN
execute 
format ('select * from 
            cypher(''%1$s'', 
                $$match (t:trips),(s:stops)
                         where t.id=%2$s and s.id=%3$s and t.tc_date=%11$s and t.tc_date=s.tc_date
                    create(t)<-
                    [:PART_OF_TRIP]-
                    (st:stop_times {tc_date: %11$s, arrival_time: %4$s, arrival_time_seconds: %5$s, departure_time: %6$s, departure_time_seconds: %7$s, 
                            stop_sequence: %8$s, pickup_type: %9$s, drop_off_type: %10$s})-
                    [:LOCATED_AT]->(s)$$) as (st agtype);',
            quote_ident(graph_name),quote_ident(trip_id),quote_ident(stop_id),
                    quote_ident(arrival_time),to_char(arrival_time_sec,'999999'),
                    quote_ident(departure_time),to_char(departure_time_sec,'999999'),
                    to_char(stop_sequence,'9999'),to_char(pickup_type,'9'),to_char(drop_off_type,'9'),
                    format('"%s"',to_char(tc_date,'yyyy-mm-dd')));
return nodename;
END
$stop_times$
;
select create_stop_times('transport_network',a.tc_date,a.trip_id,a.stop_id,a.arrival_time,a.arrival_time_sec,a.departure_time,a.departure_time_sec,a.row_num,a.pickup_type,a.drop_off_type)
from stop_times a
;

This is about 7:1 difference in cost between the native SQL and the graph query.

jrgemignani commented 1 year ago

@I33Buckler If there is a cypher command causing a crash, or failing, we can try to look into that but, we will need more information about what is crashing or failing. The log information, while nice to show there might be an issue, isn't enough to help.

2023-10-26 23:16:36.212 AEDT [45562] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer 2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis LOG: could not send data to client: Broken pipe

As for the estimated cost difference between the two queries, remember that you are comparing two very different queries with very different runtimes. For example, the SQL query doesn't even write to another table and one can't just assume that a Cypher query only has to write to just one table to save data.

I know it's easier to expect them to work similarly because the languages appear similar but, that's about as similar as they get.

I33Buckler commented 1 year ago

@I33Buckler If there is a cypher command causing a crash, or failing, we can try to look into that but, we will need more information about what is crashing or failing. The log information, while nice to show there might be an issue, isn't enough to help.

2023-10-26 23:16:36.212 AEDT [45562] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer 2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis LOG: could not send data to client: Broken pipe

As for the estimated cost difference between the two queries, remember that you are comparing two very different queries with very different runtimes. For example, the SQL query doesn't even write to another table and one can't just assume that a Cypher query only has to write to just one table to save data.

I know it's easier to expect them to work similarly because the languages appear similar but, that's about as similar as they get.

Thanks for the clarification regarding the cypher query process. Unfortunately the log is the only information I could find at the time of the query crash. So far the process is still running.

jrgemignani commented 1 year ago

@I33Buckler If you are using that function, with the O(N^3) runtime, or greater, I wouldn't wait as it could take a very long time. You need to find a way to write a more efficient query to do what you are doing.

I33Buckler commented 1 year ago

@I33Buckler If you are using that function, with the O(N^3) runtime, or greater, I wouldn't wait as it could take a very long time. You need to find a way to write a more efficient query to do what you are doing.

Thank you. Hence the question. I modified the example described at Is it possible to create a graph in AGE using existing table in the database? to load directly from a table as shown in the code below. The problem was that running this process as is without segmentation resulted in an unexpected crash of the query. I tested the process by segmenting at tc_date which created 23 segments and running a couple of tests. Each test ran successfully unlike the full loading of the table. Hence the decision to loop the process over tc_date processing each segment separately. Until I am able to determine the cause of the crash and possible solution the segmentation is the best response. Any suggestions on improving the process below are welcome.

-- Function to load database row into graph schema 
CREATE OR REPLACE FUNCTION create_stop_times(graph_name text, tc_date date, trip_id text, stop_id text, 
                        arrival_time text, arrival_time_sec bigint, departure_time text, departure_time_sec bigint,
                        stop_sequence bigint, pickup_type int, drop_off_type int)
returns text 
LANGUAGE plpgsql
VOLATILE
as $stop_times$
declare 
    nodename text := graph_name || '.' || 'stop_times';
BEGIN
execute 
format ('select * from 
            cypher(''%1$s'', 
                $$match (t:trips),(s:stops)
                         where t.id=%2$s and s.id=%3$s and t.tc_date=%11$s and t.tc_date=s.tc_date
                    create(t)<-
                    [:PART_OF_TRIP]-
                    (st:stop_times {tc_date: %11$s, arrival_time: %4$s, arrival_time_seconds: %5$s, departure_time: %6$s, departure_time_seconds: %7$s, 
                            stop_sequence: %8$s, pickup_type: %9$s, drop_off_type: %10$s})-
                    [:LOCATED_AT]->(s)$$) as (st agtype);',
            quote_ident(graph_name),quote_ident(trip_id),quote_ident(stop_id),
                    quote_ident(arrival_time),to_char(arrival_time_sec,'999999'),
                    quote_ident(departure_time),to_char(departure_time_sec,'999999'),
                    to_char(stop_sequence,'9999'),to_char(pickup_type,'9'),to_char(drop_off_type,'9'),
                    format('"%s"',to_char(tc_date,'yyyy-mm-dd')));
return nodename;
END
$stop_times$
;
-- Create table with row number for referencing during step through
create temp table stop_times as select dense_rank() over (order by tc_date,trip_id) as trip_rank,
row_number() over (partition by tc_date,trip_id order by stop_sequence) as row_num,* from allservices.stop_times
order by row_num;

create index on stop_times(tc_date);
create index on stop_times(trip_id);
create index on stop_times(stop_id);
create index on stop_times (trip_rank);
create index on stop_times (row_num);
create unique index on stop_times (trip_rank,row_num);
create unique index on stop_times (trip_rank,stop_sequence);

-- Load table into graph schema
do $$
declare graphname varchar(32);
        graph_start timestamp;
        graph_end timestamp;
        duration time;
begin
graph_start := clock_timestamp();
graph_end := clock_timestamp();
raise notice 'Start of query: %',graph_start;

perform create_stop_times('transport_network',
a.tc_date,a.trip_id,a.stop_id,
a.arrival_time,a.arrival_time_sec,
a.departure_time,a.departure_time_sec,
a.row_num,a.pickup_type,a.drop_off_type)
from stop_times a;
graph_end := clock_timestamp();

raise notice 'End of query: %',graph_end;
raise notice 'Duration of graph query: %', graph_end-graph_start;
end;$$;
jrgemignani commented 12 months ago

@I33Buckler I'd be careful with StackOverflow answers that have few, if any, responses. That one, in particular, is not very helpful in my opinion.

I feel the following might be of use, I don't know how much experience or expertise you have with graphs or your level of CS understanding, so please don't take offense if I repeat or state the obvious,...

A graph in the context of Apache AGE is a computer science abstract data type -https://en.wikipedia.org/wiki/Graph_(abstract_data_type) Apache AGE specifically implements a digraph (directional graph) with labels and properties for each component.

I'm going to give you the following links because, on the surface, it appears that what you are doing might fall into this category of problems.

https://en.wikipedia.org/wiki/Computational_complexity_theory

Specifically, ones like this. https://en.wikipedia.org/wiki/Travelling_salesman_problem

If it does, you're going to need to carefully look at your desired result and possibly restrict it to make it more manageable. Again, I only have a general idea of what you are doing.

Hope this is helpful.

I33Buckler commented 12 months ago

@I33Buckler I'd be careful with StackOverflow answers that have few, if any, responses. That one, in particular, is not very helpful in my opinion.

Thanks. I referred to it as it was the only answer to the question as to how to load into Apache AGE from an existing table as most other examples involved loading a single row.

I feel the following might be of use, I don't know how much experience or expertise you have with graphs or your level of CS understanding, so please don't take offense if I repeat or state the obvious,...

Not much experience with graphs or CS so explanations are welcome. Always an opportunity to learn.

A graph in the context of Apache AGE is a computer science abstract data type -https://en.wikipedia.org/wiki/Graph_(abstract_data_type) Apache AGE specifically implements a digraph (directional graph) with labels and properties for each component.

I'm going to give you the following links because, on the surface, it appears that what you are doing might fall into this category of problems.

https://en.wikipedia.org/wiki/Computational_complexity_theory

Thank you.

Specifically, ones like this. https://en.wikipedia.org/wiki/Travelling_salesman_problem

If it does, you're going to need to carefully look at your desired result and possibly restrict it to make it more manageable. Again, I only have a general idea of what you are doing.

Hope this is helpful.

This is rather helpful. Thank you for your suggestions. The current process completed the creation of the graph node STOP_TIMES and vertices after 12 days. Will look into your suggestions for consideration of how I can improve the process.

The intention of this exercise is to explore the network data identifying transport deserts and comparison of changes as part of the process. It is a self directed exercise which is a work in progress.

I33Buckler commented 11 months ago

@jrgemignani Is it possible or are there plans to enable partitioning a graph node? That would help greatly with processing. At the moment the only solution obvious to me is to create a separate node in the graph for each partition in the database.

jrgemignani commented 11 months ago

@jrgemignani Is it possible or are there plans to enable partitioning a graph node? That would help greatly with processing. At the moment the only solution obvious to me is to create a separate node in the graph for each partition in the database.

I'm not exactly sure what you mean by partitioning a node.

I33Buckler commented 11 months ago

@jrgemignani Is it possible or are there plans to enable partitioning a graph node? That would help greatly with processing. At the moment the only solution obvious to me is to create a separate node in the graph for each partition in the database.

I'm not exactly sure what you mean by partitioning a node.

I am wondering if it would be possible to implement something like table partitioning for the underlying table in the graph as described at 5.11. Table Partitioning # I have done this to the table AllServices.Stop_Times described earlier which has considerably improved the querying of the table. The partition enables querying the individual tables by addressing the parent table as a single table with the criteria that defines the partition.

jrgemignani commented 11 months ago

@rafsun42 Thoughts?

rafsun42 commented 11 months ago

@I33Buckler

I have done this to the table AllServices.Stop_Times

Could you please elaborate how you did the partitioning? Is AllServices.Stop_Times an AGE table?

I33Buckler commented 11 months ago

@I33Buckler

I have done this to the table AllServices.Stop_Times

Could you please elaborate how you did the partitioning? Is AllServices.Stop_Times an AGE table?

AllServices.Stop_Times is not an AGE table. It was mentioned as a note that partition works natively in PostgreSQL. Whereas creating the AGE table from the native table requires the partitions be combined into a single table which eliminates the query performance improvement available in the partitioned native table. Each partition is created by a variation on the following query which creates a partition of records where tc_date='2020-04-06'. Note that the table allservices.stop_times is defined as a general table statement from which the partition inherits attributes in this query.

CREATE TABLE allservices.stop_times_20200406 (
    CONSTRAINT stop_times_ck_20200406 CHECK ((tc_date = '2020-04-06'::date)),
    CONSTRAINT stop_times_pk_20200406 PRIMARY KEY (tc_date, trip_id, stop_sequence)
)
INHERITS (allservices.stop_times);
rafsun42 commented 11 months ago

@I33Buckler Currently, AGE tables are paritioned by their label name. Further partitoning a label's table may be useful. It may speed up (some) MATCH queries more than CREATE queries, if that's what you are referring to.

I33Buckler commented 11 months ago

@I33Buckler Currently, AGE tables are paritioned by their label name. Further partitoning a label's table may be useful. It may speed up (some) MATCH queries more than CREATE queries, if that's what you are referring to.

Thank you for your response. Speeding up MATCH queries would be helpful. Also minimising the amount of the table loaded into memory. In addition the intention is to be able to compare layers without having to construct 23 individual graphs for each layer.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 45 days with no activity. Remove "Abondoned" label or comment or this will be closed in 7 days.

github-actions[bot] commented 5 months ago

This issue was closed because it has been stalled for further 7 days with no activity.