Closed bmyerz closed 10 years ago
Those benchmarks were obtained with the sigmod-pr and sigmod-cc branches. The main optimization in these branches is an in-memory shuffle for Spark, which provided a 2-3x performance improvement for us on EC2.
We recently reran the PageRank and connected components benchmarks with the vldb branch, which also includes this optimization. I suggest using this branch since it's more up to date.
Thanks, I'm still not getting the performance I expect by a few x. Do you know the options that provided the best performance? In particular,
Pregel
(called from Analytics
)@dcrankshaw can add details and corrections, but as I recall we used a modified version of the Analytics driver (./bin/spark-class org.apache.spark.graphx.lib.Analytics HOST {pagerank,cc} FILENAME
) where we reduced the number of partitions by calling RDD#coalesce
on the edge RDD before passing it to the graph constructor.
We experimented with the type and number of partitions and found that fewer partitions was better, and that the optimal partitioning strategy depended on the input. For number of partitions, we settled on 64 partitions for a cluster with 16 8-core workers. This is fewer than the number of cores, so we wasted some compute capacity in exchange for more efficient shuffles. For partitioning strategy, if the edge list was already sorted on disk we left the graph in its default partitioning. Otherwise, we used RandomVertexCut.
Are you referring to something like this? https://github.com/bmyerz/graphx/tree/coalesce-edges
I suppose I thought parsing and using EdgeBuilder would preserve the partitions as minEdgePartitions
passed to textFile
I haven't exhausted the space of parameters, but so far for 64 nodes: 1 partition per node on 16-core nodes is as fast as anything else.
Sorry I didn't get back to you. You actually have to coalesce before building the edge partitions, like this: https://github.com/amplab/graphx2/commit/ad4c87467f06f0bb244e0ce7571bf2bba144ac57
But that probably won't give you more than 50% improvement or so.
We are comparing performance on some of the example applications in Analytics.scala. Results show GraphX pagerank performs within a factor of 2-3 of Graphlab. I haven't yet been able to get near this performance.
configurations:
bin/spark-class -Dspark.executor.memory=52g org.apache.spark.graphx.lib.Analytics <master> pagerank <graph> --tol=0.01 --numEPart=16
Do you have a pointer to which branch and what methodology I should use to replicate these results?