cwida / duckpgq-extension

DuckDB extension that adds support for SQL/PGQ
https://duckpgq.notion.site/b8ac652667964f958bfada1c3e53f1bb?v=3b47a8d44bdf4e0c8b503bf23f1b76f2
MIT License
72 stars 7 forks source link

Weakly connected component #137

Closed Dtenwolde closed 2 months ago

Dtenwolde commented 2 months ago

Fixes #138

This PR adds the table function weakly_connected_component.

from weakly_connected_component(snb, person, knows);

The table function returns two columns, the primary key of the source table and the componentId. The componentId is equal to the smallest rowid of the connected component.

The function works on an undirected graph, traversed using a CSR data structure and MS-BFS until exhaustion. Then, all the nodes that are reachable from a source are assigned the lowest ID (rowid) in the connected component.

Timings

Reported numbers are the mean runtime over 5 runs only measuring the weakly connected component function. Excluding data loading. Measured on a 2021 Macbook Pro M1.

Scale Factor DuckPGQ Timing NetworkX Timing Speedup (NetworkX / DuckPGQ) Neo4j Timing Speedup (Neo4j / DuckPGQ)
SF1 0.093s 0.022s 0.24x 0.200s 2.15x
SF3 0.227s 0.074s 0.33x 0.349s 1.54x
SF10 0.768s 0.370s 0.48x 0.970s 1.26x

DuckPGQ

-CREATE PROPERTY GRAPH snb
  VERTEX TABLES (
 Person
 )
  EDGE TABLES (
  Person_knows_person     SOURCE KEY (Person1Id) REFERENCES Person (id)
                              DESTINATION KEY (Person2Id) REFERENCES Person (id)
                             LABEL Knows);

from weakly_connected_component(snb, person, knows);

NetworkX

def calculate_cc(G: nx.Graph) -> pd.DataFrame:
    # works on undirected graph
    connected_components = [c for c in sorted(nx.connected_components(G), key=len, reverse=True)]
    # Prepare lists to hold the IDs and their corresponding componentId
    ids = []
    component_ids = []

    for component in connected_components:
        # Determine the componentId as the minimum node ID in the component
        component_id = min(component)

        # Add each node in the component to the lists
        for node in component:
            ids.append(int(node))
            component_ids.append(int(component_id))

    # Create a DataFrame from the lists
    wcc_df = pd.DataFrame({
        'id': ids,
        'componentId': component_ids
    })

    return wcc_df

Neo4j

CALL gds.graph.project(
    'myGraph',
    'Person',
    'KNOWS'
)

CALL gds.wcc.stream('myGraph')
YIELD nodeId, componentId
RETURN gds.util.asNode(nodeId).id AS id, componentId
ORDER BY componentId, id