cwida / duckpgq-extension

DuckDB extension that adds support for SQL/PGQ
https://duckpgq.notion.site/b8ac652667964f958bfada1c3e53f1bb?v=3b47a8d44bdf4e0c8b503bf23f1b76f2
MIT License
86 stars 7 forks source link

PageRank table function #144

Closed Dtenwolde closed 2 months ago

Dtenwolde commented 2 months ago

Fixes #143

This PR introduces the PageRank table function, similar to WCC and LCC.

from pagerank(pg, node_table, edge_table)

Will return a table with the ID and PageRank of the ID. It uses the CSR data structure, similar to the other graph algorithms and iteratively calculates the PageRank. The current implementation is mostly single threaded. The first vector to execute the function will obtain a lock and run until converged. The next vectors do not have to perform any calculation but only get the rank for the ID. If this turns out to be bottleneck in the future, we might need to look into a parallel solution, but for now this suffices.

The default parameters are: Damping factor: 0.85 Tolerance: 1e-6 Initial rank: 1/number of vertices A future PR might add the option to allow users to change these options either through a SET or as arguments for the table function pagerank(pg, node, edge, damping_factor, tolerance)

There are slight discrepancies when comparing the output to NetworkX and Neo4j, but they are generally close. For example SF1 between NetworkX and DuckPGQ:

Mean Absolute Error (MAE): 7.867113585890136e-05
Mean Squared Error (MSE): 5.77749701647327e-08
Root Mean Squared Error (RMSE): 0.00024036424477183102
Maximum Absolute Error: 0.008648239988167099
SF NetworkX Time (s) DuckPGQ Time (s) Speedup (NetworkX / DuckPGQ)
1 0.28 0.044 6.36
3 0.99 0.053 18.68
10 4.26 0.151 28.21