gchq / Gaffer

A large-scale entity and relation database supporting aggregation of properties
Apache License 2.0
1.76k stars 352 forks source link

PageRank implementation #1438

Open t616178 opened 6 years ago

t616178 commented 6 years ago

Provide an implementation of the PageRank algorithm using GraphFrames to demonstrate use of the GraphFrames API as a Gaffer operation.

t616178 commented 6 years ago

Should revert to Graphframes 0.4.0 to see if using an old version of the library solves the performance issues found in 0.5.0.

p013570 commented 6 years ago

We should create a new module within the library module: graph-analytic-library. The PageRank operation should go in there.

The PageRank operation should have a generic input type and a generic output type.

The PageRank operation should not have a dependency on Spark or GraphFrames, but allow an operation handler to implement it in anyway it wants to. This would cause the input/output types to be unpredictable. So, perhaps in addition to PageRank we need GraphFramePageRank so users can be sure what the input and output type should be.

Within the spark module we will also need to add a new module: spark-graph-analytic-library. Then add a GraphFramesPageRank operation in there that extends PageRank<GraphFrames, GraphFrames>. The handler for the graph frames operation can also go in this module. Perhaps we should implement 2 handlers one for PageRank and one for GraphFramesPageRank. The handler could do:

if(!(operation.getInput() instanceof GraphFrame) { // extract a graph frame first. }

Then a Gaffer system could have multiple versions of PageRank using different technologies all available at the same time via different operations. But, if a user doesn't care how the operation is implemented they could just use the top level PageRank operation.

p013570 commented 6 years ago

This issue is basically ready to be merged in, except for an issue with performance that needs to be investigated.