Closed Dtenwolde closed 2 months ago
We can materialize the results of the shortest path function, which saves computation time if the function is referenced multiple times in the same query. The trade-off is that we must connect the source and destination IDs within the Common Table Expression (CTE) to the main part of the query when referencing other columns. In the query below, we want to return the first and last names for both the source and destination. These columns are referenced in the main query, while the shortest paths for these pairs are computed in the CTE. We "connect" these by performing a join on the source rowid
and destination rowid
.
WITH shortest_path_p AS MATERIALIZED (
SELECT shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid) AS path,
p1.rowid AS src_id,
p2.rowid AS dst_id
FROM Person p1
JOIN Person p2 ON p1.rowid <> p2.rowid
CROSS JOIN (SELECT count(cte1.temp) * 0 AS temp FROM cte1) __x
WHERE p1.firstName = 'John' AND __x.temp * 0 = 0
)
SELECT path,
p1.firstName || ' ' || p1.lastName AS src_name,
p2.firstName || ' ' || p2.lastName AS dst_name,
len(path) // 2,
path[1:-:2] AS vertices,
path[2:-:2] AS edges
FROM shortest_path_p
JOIN Person p1 ON p1.rowid = src_id
JOIN Person p2 ON p2.rowid = dst_id
WHERE p1.firstName = 'John';
Running a query starting from any person named 'John' with no lower or upper limit on the number of hops.
Old Method:
CTE Method:
Only referencing the shortest path function once shows that the performance is equal between the old and new method. The shortest path UDF dominates the running time.
Old Method:
CTE Method:
WITH cte1 AS (
SELECT CREATE_CSR_EDGE(
0,
(SELECT count(p.id) FROM Person p),
CAST (
(SELECT sum(CREATE_CSR_VERTEX(
0,
(SELECT count(p.id) FROM Person p),
sub.dense_id,
sub.cnt)
)
FROM (
SELECT p.rowid AS dense_id, count(knows.Person1Id) AS cnt
FROM Person p
LEFT JOIN Person_knows_Person knows ON knows.Person1Id = p.id
GROUP BY p.rowid) sub
)
AS BIGINT),
p1.rowid,
p2.rowid,
knows.rowid
) AS temp
FROM Person_knows_Person knows
JOIN Person p1 ON p1.id = knows.Person1Id
JOIN Person p2 ON p2.id = knows.Person2Id
)
SELECT shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid) AS path,
p1.firstName || ' ' || p1.lastName AS src_name,
p2.firstName || ' ' || p2.lastName AS dst_name,
len(shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid)) // 2 AS path_length,
shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid)[1:-:2] AS vertices,
shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid)[2:-:2] AS edges
FROM Person p1, Person p2, (SELECT count(cte1.temp) * 0 AS temp FROM cte1) __x
WHERE p1.firstName = 'John' AND __x.temp * 0 = 0;
WITH cte1 AS (
SELECT CREATE_CSR_EDGE(
0,
(SELECT count(p.id) FROM Person p),
CAST (
(SELECT sum(CREATE_CSR_VERTEX(
0,
(SELECT count(p.id) FROM Person p),
sub.dense_id,
sub.cnt)
)
FROM (
SELECT p.rowid AS dense_id, count(knows.Person1Id) AS cnt
FROM Person p
LEFT JOIN Person_knows_Person knows ON knows.Person1Id = p.id
GROUP BY p.rowid) sub
)
AS BIGINT),
p1.rowid,
p2.rowid,
knows.rowid
) AS temp
FROM Person_knows_Person knows
JOIN Person p1 ON p1.id = knows.Person1Id
JOIN Person p2 ON p2.id = knows.Person2Id
), shortest_path_p AS MATERIALIZED (
SELECT shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid) AS path,
p1.rowid AS src_id,
p2.rowid AS dst_id
FROM Person p1
JOIN Person p2 ON p1.rowid <> p2.rowid
CROSS JOIN (SELECT count(cte1.temp) * 0 AS temp FROM cte1) __x
WHERE p1.firstName = 'John' AND __x.temp * 0 = 0
)
SELECT path,
p1.firstName || ' ' || p1.lastName AS src_name,
p2.firstName || ' ' || p2.lastName AS dst_name,
len(path) // 2,
path[1:-:2] AS vertices,
path[2:-:2] AS edges
FROM shortest_path_p
JOIN Person p1 ON p1.rowid = src_id
JOIN Person p2 ON p2.rowid = dst_id
WHERE p1.firstName = 'John';
Currently, for every function call related to path-finding (path_length, element_id, vertices, edges), we re-execute the shortest path (or iterativelength) function. Moving this to a separate CTE will enable us to execute the function only once, from which the result can be referenced many times.
The query is not completely correct but it proves the point: