Closed I33Buckler closed 5 months ago
@I33Buckler AGE has utilities to load graph from CSV. Can it be used instead of plpgsql functions? https://age.apache.org/age-manual/master/intro/agload.html
Also, are you using the latest master? A patch on detach delete was merged recently.
@I33Buckler AGE has utilities to load graph from CSV. Can it be used instead of plpgsql functions? https://age.apache.org/age-manual/master/intro/agload.html
Rather limiting for recommending a database extension to process the table by exporting to CSV and processing it outside the database. Expect to be able to process the database directly. Found this page a useful reference. https://stackoverflow.com/questions/75178525/is-it-possible-to-create-a-graph-in-age-using-existing-table-in-the-database The reason for the looping is that when I processed a table as a direct read the function completed without reading the entire table into graph. Note this was developed on smaller tables. The next smaller table to Stop_Times has abut 35,000 rows.
Also, are you using the latest master? A patch on detach delete was merged recently. Thank you for the reminder. I have pulled the updates from the repository and rebuilt the module. The reported version is 1.4.0 Note that detach delete is only called once to initialise the graph node before loading in the contents for the loop. This was added for removing nodes that were created as a result of testing specific sections of the table AllServices.Stop_Times. Process is still taking a long time to run.
@I33Buckler Your create_stop_times
function is O(N^3) and that is inside a doubly nested loop that also does SELECTs from tables for each level. This can lead to as much as O(N^5) (possibly more) runtime for the whole process.
Even without considering the 2 external loops which could be O(N^2), just the O(N^3) for create_stop_times
is bad enough to make what you are doing only viable for small datasets.
My suggestion is that you need to rewrite your code to make it more efficient. At this moment, this does not appear to be an Apache AGE issue.
If you have additional questions or need some guidance, please do ask and we will try to help where possible.
@I33Buckler Your
create_stop_times
function is O(N^3) and that is inside a doubly nested loop that also does SELECTs from tables for each level. This can lead to as much as O(N^5) (possibly more) runtime for the whole process.Even without considering the 2 external loops which could be O(N^2), just the O(N^3) for
create_stop_times
is bad enough to make what you are doing only viable for small datasets.My suggestion is that you need to rewrite your code to make it more efficient. At this moment, this does not appear to be an Apache AGE issue.
If you have additional questions or need some guidance, please do ask and we will try to help where possible.
Thank you for the response. Very helpful. One of the reasons for looping was that the running the query on the table without looping resulted in it failing with the following messages in the postgresql log. Note that the process is running directly on the server.
023-10-26 23:13:45.675 AEDT [44900] LOG: checkpoint starting: time
2023-10-26 23:14:59.530 AEDT [44900] LOG: checkpoint complete: wrote 725 buffers (0.1%); 0 WAL file(s) added, 0 removed, 1 recycled; write=72.978 s, sync=0.509 s, total=73.855 s; sync files=79, longest=0.089 s, average=0.007 s; distance=9734 kB, estimate=23911 kB
2023-10-26 23:16:36.209 AEDT [45561] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.210 AEDT [45558] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.210 AEDT [45559] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.210 AEDT [45572] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.212 AEDT [45560] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer
2023-10-26 23:16:36.212 AEDT [45562] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis LOG: could not send data to client: Broken pipe
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis STATEMENT: select create_stop_times('transport_network',
a.tc_date,a.trip_id,a.stop_id,
a.arrival_time,a.arrival_time_sec,
a.departure_time,a.departure_time_sec,
a.row_num,a.pickup_type,a.drop_off_type)
from stop_times a
;
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis FATAL: connection to client lost
2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis STATEMENT: select create_stop_times('transport_network',
a.tc_date,a.trip_id,a.stop_id,
a.arrival_time,a.arrival_time_sec,
a.departure_time,a.departure_time_sec,
a.row_num,a.pickup_type,a.drop_off_type)
from stop_times a
;
2023-10-26 23:18:45.540 AEDT [44900] LOG: checkpoint starting: time
I've partitioned the table by tc_date processing by looping through each tc_date which hopefully will improve the timing. The result is 23 separate steps for each tc_date.
A factor that I've noticed with regard to the noticeable performance difference between joining the PostgreSQL tables in the database and and joining the nodes is the difference in estimated cost. The native query below in response to EXECUTION PLAN returns a total cost score of 296852.63.
select st.* from
allservices.stops s,
allservices.trips t,
allservices.stop_times st
where s.tc_date=t.tc_date and t.tc_date=st.tc_date
and s.stop_id=st.stop_id and t.trip_id=st.trip_id
;
Whereas the graph query in response to the EXECUTION PLAN returns a total cost score of 2115202.70
CREATE OR REPLACE FUNCTION create_stop_times(graph_name text, tc_date date, trip_id text, stop_id text,
arrival_time text, arrival_time_sec bigint, departure_time text, departure_time_sec bigint,
stop_sequence bigint, pickup_type int, drop_off_type int)
returns text
LANGUAGE plpgsql
VOLATILE
as $stop_times$
declare
nodename text := graph_name || '.' || 'stop_times';
BEGIN
execute
format ('select * from
cypher(''%1$s'',
$$match (t:trips),(s:stops)
where t.id=%2$s and s.id=%3$s and t.tc_date=%11$s and t.tc_date=s.tc_date
create(t)<-
[:PART_OF_TRIP]-
(st:stop_times {tc_date: %11$s, arrival_time: %4$s, arrival_time_seconds: %5$s, departure_time: %6$s, departure_time_seconds: %7$s,
stop_sequence: %8$s, pickup_type: %9$s, drop_off_type: %10$s})-
[:LOCATED_AT]->(s)$$) as (st agtype);',
quote_ident(graph_name),quote_ident(trip_id),quote_ident(stop_id),
quote_ident(arrival_time),to_char(arrival_time_sec,'999999'),
quote_ident(departure_time),to_char(departure_time_sec,'999999'),
to_char(stop_sequence,'9999'),to_char(pickup_type,'9'),to_char(drop_off_type,'9'),
format('"%s"',to_char(tc_date,'yyyy-mm-dd')));
return nodename;
END
$stop_times$
;
select create_stop_times('transport_network',a.tc_date,a.trip_id,a.stop_id,a.arrival_time,a.arrival_time_sec,a.departure_time,a.departure_time_sec,a.row_num,a.pickup_type,a.drop_off_type)
from stop_times a
;
This is about 7:1 difference in cost between the native SQL and the graph query.
@I33Buckler If there is a cypher command causing a crash, or failing, we can try to look into that but, we will need more information about what is crashing or failing. The log information, while nice to show there might be an issue, isn't enough to help.
2023-10-26 23:16:36.212 AEDT [45562] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer 2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis LOG: could not send data to client: Broken pipe
As for the estimated cost difference between the two queries, remember that you are comparing two very different queries with very different runtimes. For example, the SQL query doesn't even write to another table and one can't just assume that a Cypher query only has to write to just one table to save data.
I know it's easier to expect them to work similarly because the languages appear similar but, that's about as similar as they get.
@I33Buckler If there is a cypher command causing a crash, or failing, we can try to look into that but, we will need more information about what is crashing or failing. The log information, while nice to show there might be an issue, isn't enough to help.
2023-10-26 23:16:36.212 AEDT [45562] matthew@tcanalysis LOG: could not receive data from client: Connection reset by peer 2023-10-26 23:17:20.903 AEDT [45099] matthew@tcanalysis LOG: could not send data to client: Broken pipe
As for the estimated cost difference between the two queries, remember that you are comparing two very different queries with very different runtimes. For example, the SQL query doesn't even write to another table and one can't just assume that a Cypher query only has to write to just one table to save data.
I know it's easier to expect them to work similarly because the languages appear similar but, that's about as similar as they get.
Thanks for the clarification regarding the cypher query process. Unfortunately the log is the only information I could find at the time of the query crash. So far the process is still running.
@I33Buckler If you are using that function, with the O(N^3) runtime, or greater, I wouldn't wait as it could take a very long time. You need to find a way to write a more efficient query to do what you are doing.
@I33Buckler If you are using that function, with the O(N^3) runtime, or greater, I wouldn't wait as it could take a very long time. You need to find a way to write a more efficient query to do what you are doing.
Thank you. Hence the question. I modified the example described at Is it possible to create a graph in AGE using existing table in the database? to load directly from a table as shown in the code below. The problem was that running this process as is without segmentation resulted in an unexpected crash of the query. I tested the process by segmenting at tc_date which created 23 segments and running a couple of tests. Each test ran successfully unlike the full loading of the table. Hence the decision to loop the process over tc_date processing each segment separately. Until I am able to determine the cause of the crash and possible solution the segmentation is the best response. Any suggestions on improving the process below are welcome.
-- Function to load database row into graph schema
CREATE OR REPLACE FUNCTION create_stop_times(graph_name text, tc_date date, trip_id text, stop_id text,
arrival_time text, arrival_time_sec bigint, departure_time text, departure_time_sec bigint,
stop_sequence bigint, pickup_type int, drop_off_type int)
returns text
LANGUAGE plpgsql
VOLATILE
as $stop_times$
declare
nodename text := graph_name || '.' || 'stop_times';
BEGIN
execute
format ('select * from
cypher(''%1$s'',
$$match (t:trips),(s:stops)
where t.id=%2$s and s.id=%3$s and t.tc_date=%11$s and t.tc_date=s.tc_date
create(t)<-
[:PART_OF_TRIP]-
(st:stop_times {tc_date: %11$s, arrival_time: %4$s, arrival_time_seconds: %5$s, departure_time: %6$s, departure_time_seconds: %7$s,
stop_sequence: %8$s, pickup_type: %9$s, drop_off_type: %10$s})-
[:LOCATED_AT]->(s)$$) as (st agtype);',
quote_ident(graph_name),quote_ident(trip_id),quote_ident(stop_id),
quote_ident(arrival_time),to_char(arrival_time_sec,'999999'),
quote_ident(departure_time),to_char(departure_time_sec,'999999'),
to_char(stop_sequence,'9999'),to_char(pickup_type,'9'),to_char(drop_off_type,'9'),
format('"%s"',to_char(tc_date,'yyyy-mm-dd')));
return nodename;
END
$stop_times$
;
-- Create table with row number for referencing during step through
create temp table stop_times as select dense_rank() over (order by tc_date,trip_id) as trip_rank,
row_number() over (partition by tc_date,trip_id order by stop_sequence) as row_num,* from allservices.stop_times
order by row_num;
create index on stop_times(tc_date);
create index on stop_times(trip_id);
create index on stop_times(stop_id);
create index on stop_times (trip_rank);
create index on stop_times (row_num);
create unique index on stop_times (trip_rank,row_num);
create unique index on stop_times (trip_rank,stop_sequence);
-- Load table into graph schema
do $$
declare graphname varchar(32);
graph_start timestamp;
graph_end timestamp;
duration time;
begin
graph_start := clock_timestamp();
graph_end := clock_timestamp();
raise notice 'Start of query: %',graph_start;
perform create_stop_times('transport_network',
a.tc_date,a.trip_id,a.stop_id,
a.arrival_time,a.arrival_time_sec,
a.departure_time,a.departure_time_sec,
a.row_num,a.pickup_type,a.drop_off_type)
from stop_times a;
graph_end := clock_timestamp();
raise notice 'End of query: %',graph_end;
raise notice 'Duration of graph query: %', graph_end-graph_start;
end;$$;
@I33Buckler I'd be careful with StackOverflow answers that have few, if any, responses. That one, in particular, is not very helpful in my opinion.
I feel the following might be of use, I don't know how much experience or expertise you have with graphs or your level of CS understanding, so please don't take offense if I repeat or state the obvious,...
A graph in the context of Apache AGE is a computer science abstract data type -https://en.wikipedia.org/wiki/Graph_(abstract_data_type) Apache AGE specifically implements a digraph (directional graph) with labels and properties for each component.
I'm going to give you the following links because, on the surface, it appears that what you are doing might fall into this category of problems.
https://en.wikipedia.org/wiki/Computational_complexity_theory
Specifically, ones like this. https://en.wikipedia.org/wiki/Travelling_salesman_problem
If it does, you're going to need to carefully look at your desired result and possibly restrict it to make it more manageable. Again, I only have a general idea of what you are doing.
Hope this is helpful.
@I33Buckler I'd be careful with StackOverflow answers that have few, if any, responses. That one, in particular, is not very helpful in my opinion.
Thanks. I referred to it as it was the only answer to the question as to how to load into Apache AGE from an existing table as most other examples involved loading a single row.
I feel the following might be of use, I don't know how much experience or expertise you have with graphs or your level of CS understanding, so please don't take offense if I repeat or state the obvious,...
Not much experience with graphs or CS so explanations are welcome. Always an opportunity to learn.
A graph in the context of Apache AGE is a computer science abstract data type -https://en.wikipedia.org/wiki/Graph_(abstract_data_type) Apache AGE specifically implements a digraph (directional graph) with labels and properties for each component.
I'm going to give you the following links because, on the surface, it appears that what you are doing might fall into this category of problems.
https://en.wikipedia.org/wiki/Computational_complexity_theory
Thank you.
Specifically, ones like this. https://en.wikipedia.org/wiki/Travelling_salesman_problem
If it does, you're going to need to carefully look at your desired result and possibly restrict it to make it more manageable. Again, I only have a general idea of what you are doing.
Hope this is helpful.
This is rather helpful. Thank you for your suggestions. The current process completed the creation of the graph node STOP_TIMES and vertices after 12 days. Will look into your suggestions for consideration of how I can improve the process.
The intention of this exercise is to explore the network data identifying transport deserts and comparison of changes as part of the process. It is a self directed exercise which is a work in progress.
@jrgemignani Is it possible or are there plans to enable partitioning a graph node? That would help greatly with processing. At the moment the only solution obvious to me is to create a separate node in the graph for each partition in the database.
@jrgemignani Is it possible or are there plans to enable partitioning a graph node? That would help greatly with processing. At the moment the only solution obvious to me is to create a separate node in the graph for each partition in the database.
I'm not exactly sure what you mean by partitioning a node.
@jrgemignani Is it possible or are there plans to enable partitioning a graph node? That would help greatly with processing. At the moment the only solution obvious to me is to create a separate node in the graph for each partition in the database.
I'm not exactly sure what you mean by partitioning a node.
I am wondering if it would be possible to implement something like table partitioning for the underlying table in the graph as described at 5.11. Table Partitioning # I have done this to the table AllServices.Stop_Times described earlier which has considerably improved the querying of the table. The partition enables querying the individual tables by addressing the parent table as a single table with the criteria that defines the partition.
@rafsun42 Thoughts?
@I33Buckler
I have done this to the table AllServices.Stop_Times
Could you please elaborate how you did the partitioning? Is AllServices.Stop_Times
an AGE table?
@I33Buckler
I have done this to the table AllServices.Stop_Times
Could you please elaborate how you did the partitioning? Is
AllServices.Stop_Times
an AGE table?
AllServices.Stop_Times is not an AGE table. It was mentioned as a note that partition works natively in PostgreSQL. Whereas creating the AGE table from the native table requires the partitions be combined into a single table which eliminates the query performance improvement available in the partitioned native table. Each partition is created by a variation on the following query which creates a partition of records where tc_date='2020-04-06'. Note that the table allservices.stop_times is defined as a general table statement from which the partition inherits attributes in this query.
CREATE TABLE allservices.stop_times_20200406 (
CONSTRAINT stop_times_ck_20200406 CHECK ((tc_date = '2020-04-06'::date)),
CONSTRAINT stop_times_pk_20200406 PRIMARY KEY (tc_date, trip_id, stop_sequence)
)
INHERITS (allservices.stop_times);
@I33Buckler Currently, AGE tables are paritioned by their label name. Further partitoning a label's table may be useful. It may speed up (some) MATCH queries more than CREATE queries, if that's what you are referring to.
@I33Buckler Currently, AGE tables are paritioned by their label name. Further partitoning a label's table may be useful. It may speed up (some) MATCH queries more than CREATE queries, if that's what you are referring to.
Thank you for your response. Speeding up MATCH queries would be helpful. Also minimising the amount of the table loaded into memory. In addition the intention is to be able to compare layers without having to construct 23 individual graphs for each layer.
This issue is stale because it has been open 45 days with no activity. Remove "Abondoned" label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for further 7 days with no activity.
I am in the process of creating a graph node from a database table that has 7.6 million lines. The process runs a function to process the table into the graph with loops. Initial run of the process it was still running after nearly 24 hours.
To test the loop I ran it without loading into graph. The result is nearly 5 minutes of execution. The code for creating the graph node below. Removing the comments from from the section with the PERFORM statement results in the long process. Wondering if there is something missing from the process.