cwida / duckpgq-extension

DuckDB extension that adds support for SQL/PGQ
https://duckpgq.notion.site/b8ac652667964f958bfada1c3e53f1bb?v=3b47a8d44bdf4e0c8b503bf23f1b76f2
MIT License
72 stars 7 forks source link

Move path functions to CTE #131

Closed Dtenwolde closed 2 months ago

Dtenwolde commented 2 months ago

Currently, for every function call related to path-finding (path_length, element_id, vertices, edges), we re-execute the shortest path (or iterativelength) function. Moving this to a separate CTE will enable us to execute the function only once, from which the result can be referenced many times.

The query is not completely correct but it proves the point:


WITH cte1 AS (CSR_CREATION_QUERY)
SELECT //(len(shortestpath(0, (SELECT count(a.id) FROM Student AS a), a.rowid, b.rowid)), 2) AS "path_length(p)",
       shortestpath(0, (SELECT count(a.id) FROM Student AS a), a.rowid, b.rowid) AS "element_id(p)",
       a."name" AS "name",
       b."name" AS b_name
FROM (
  SELECT multiply(0, count(cte1."temp")) AS "temp"
  FROM cte1
) AS __x,
Student AS b,
Student AS a
WHERE (add(__x."temp", iterativelength(0, (SELECT count(a.id) FROM Student AS a), a.rowid, b.rowid)) BETWEEN 1 AND 3)
) AS study
Dtenwolde commented 2 months ago

Explanation

We can materialize the results of the shortest path function, which saves computation time if the function is referenced multiple times in the same query. The trade-off is that we must connect the source and destination IDs within the Common Table Expression (CTE) to the main part of the query when referencing other columns. In the query below, we want to return the first and last names for both the source and destination. These columns are referenced in the main query, while the shortest paths for these pairs are computed in the CTE. We "connect" these by performing a join on the source rowid and destination rowid.

Shortest Path CTE Query

WITH shortest_path_p AS MATERIALIZED (
    SELECT shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid) AS path, 
           p1.rowid AS src_id, 
           p2.rowid AS dst_id
    FROM Person p1
    JOIN Person p2 ON p1.rowid <> p2.rowid
    CROSS JOIN (SELECT count(cte1.temp) * 0 AS temp FROM cte1) __x
    WHERE p1.firstName = 'John' AND __x.temp * 0 = 0
)
SELECT path, 
       p1.firstName || ' ' || p1.lastName AS src_name, 
       p2.firstName || ' ' || p2.lastName AS dst_name,
       len(path) // 2,
       path[1:-:2] AS vertices,
       path[2:-:2] AS edges
FROM shortest_path_p
JOIN Person p1 ON p1.rowid = src_id
JOIN Person p2 ON p2.rowid = dst_id
WHERE p1.firstName = 'John';

Performance Comparison

Initial Results for the Worst Case Scenario

Running a query starting from any person named 'John' with no lower or upper limit on the number of hops.

Old Method:

CTE Method:

Initial results for Best Case Scenario

Only referencing the shortest path function once shows that the performance is equal between the old and new method. The shortest path UDF dominates the running time.

Old Method:

CTE Method:

Queries

Old Method Query

WITH cte1 AS (
    SELECT CREATE_CSR_EDGE(
        0,
        (SELECT count(p.id) FROM Person p),
        CAST (
            (SELECT sum(CREATE_CSR_VERTEX(
                        0,
                        (SELECT count(p.id) FROM Person p),
                        sub.dense_id,
                        sub.cnt)
                        )
            FROM (
                SELECT p.rowid AS dense_id, count(knows.Person1Id) AS cnt
                FROM Person p
                LEFT JOIN Person_knows_Person knows ON knows.Person1Id = p.id
                GROUP BY p.rowid) sub
            )
        AS BIGINT),
        p1.rowid,
        p2.rowid,
        knows.rowid
    ) AS temp
    FROM Person_knows_Person knows
    JOIN Person p1 ON p1.id = knows.Person1Id
    JOIN Person p2 ON p2.id = knows.Person2Id
)
SELECT shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid) AS path,
       p1.firstName || ' ' || p1.lastName AS src_name, 
       p2.firstName || ' ' || p2.lastName AS dst_name,
       len(shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid)) // 2 AS path_length,
       shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid)[1:-:2] AS vertices,
       shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid)[2:-:2] AS edges
FROM Person p1, Person p2, (SELECT count(cte1.temp) * 0 AS temp FROM cte1) __x
WHERE p1.firstName = 'John' AND __x.temp * 0 = 0;

CTE Method Query

WITH cte1 AS (
    SELECT CREATE_CSR_EDGE(
        0,
        (SELECT count(p.id) FROM Person p),
        CAST (
            (SELECT sum(CREATE_CSR_VERTEX(
                        0,
                        (SELECT count(p.id) FROM Person p),
                        sub.dense_id,
                        sub.cnt)
                        )
            FROM (
                SELECT p.rowid AS dense_id, count(knows.Person1Id) AS cnt
                FROM Person p
                LEFT JOIN Person_knows_Person knows ON knows.Person1Id = p.id
                GROUP BY p.rowid) sub
            )
        AS BIGINT),
        p1.rowid,
        p2.rowid,
        knows.rowid
    ) AS temp
    FROM Person_knows_Person knows
    JOIN Person p1 ON p1.id = knows.Person1Id
    JOIN Person p2 ON p2.id = knows.Person2Id
), shortest_path_p AS MATERIALIZED (
    SELECT shortestpath(0, (SELECT count(*) FROM Person), p1.rowid, p2.rowid) AS path, 
           p1.rowid AS src_id, 
           p2.rowid AS dst_id
    FROM Person p1
    JOIN Person p2 ON p1.rowid <> p2.rowid
    CROSS JOIN (SELECT count(cte1.temp) * 0 AS temp FROM cte1) __x
    WHERE p1.firstName = 'John' AND __x.temp * 0 = 0
)
SELECT path, 
       p1.firstName || ' ' || p1.lastName AS src_name, 
       p2.firstName || ' ' || p2.lastName AS dst_name,
       len(path) // 2,
       path[1:-:2] AS vertices,
       path[2:-:2] AS edges
FROM shortest_path_p
JOIN Person p1 ON p1.rowid = src_id
JOIN Person p2 ON p2.rowid = dst_id
WHERE p1.firstName = 'John';