cwida / duckpgq-extension

DuckDB extension that adds support for SQL/PGQ
https://duckpgq.notion.site/b8ac652667964f958bfada1c3e53f1bb?v=3b47a8d44bdf4e0c8b503bf23f1b76f2
MIT License
60 stars 7 forks source link

Implement PathFindingOperator #93

Open Dtenwolde opened 7 months ago

Dtenwolde commented 7 months ago

This issue serves as a way to track the progress on the PathFindingOperator

Working on in https://github.com/cwida/duckpgq-extension/tree/pathfindingoperator and https://github.com/cwida/duckdb-pgq/tree/pathfindingoperator (Make sure to be on the correct branch in both repositories)

The idea is to create a path-finding operator with two sinks. This acts similarly to the IEJoin. We insert that in this function, instead of the iterativelength() UDF. For this binding phase, we generate a logical query plan, so there cannot be a physical path-finding operator inserted quite yet. We need to create the two sinks here. One side is the src, and dst pairs (tasks) and the other side is the CSR. Importantly without the CREATE_CSR_EDGE() UDF because that will be done in one of the sinks of the new operator.

Can include optimizations such as https://github.com/cwida/duckpgq-extension/issues/23

Plan for now:

  1. Get the CSR as the first sink to this new operator.
  2. Get the (src,dst)-pairs as the second sink.
  3. Implement the path-finding algorithm
  4. Look into how to parallelize.

Potential optimizations:

  1. Duplicate (src,dst)-pairs -> only execute it once, and later blow it up to get all the results again.
  2. Same src many times -> collapses into one src, then fully explore the graph

An example query for what we have for now (initial idea):

SELECT *
FROM pairs AS p
WHERE p.src BETWEEN (select csr_id from (SELECT
            0 as csr_id,
            (SELECT count(a.id) FROM Student a),
            CAST (
                (SELECT sum(CREATE_CSR_VERTEX(0,
                    (SELECT count(a.id) FROM Student a),
                      sub.dense_id,
                      sub.cnt)
                )
                FROM (
                    SELECT a.rowid as dense_id, count(k.src) as cnt
                    FROM Student a
                    LEFT JOIN Knows k ON k.src = a.id
                    GROUP BY a.rowid) sub
            ) AS BIGINT),
            a.rowid,
            c.rowid,
            k.rowid    FROM Knows k
                        JOIN student a on a.id = k.src
                        JOIN student c on c.id = k.dst)) AND p.dst;
Dtenwolde commented 7 months ago

TODO: Figure out how to include the lower and upper bound into the query

Dtenwolde commented 6 months ago

In the current implementation, when doing the following query, it first computes all shortest paths and only then filters out the pairs:

-FROM GRAPH_TABLE (pg
    MATCH p = ANY SHORTEST (a:Person)-[k:Knows]->{2,3}(b:Person)
    WHERE (a.id, b.id) in (SELECT (src, dst) FROM pairs)
    COLUMNS (a.id AS id1, b.id AS id2, element_id(p))
    ) tmp
    ORDER BY tmp.id1, tmp.id2;

It should ideally first do the filter on the pairs, and only then do the shortest path function. This could be a potential optimization rule if we can detect this.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.