[Question] Which graphx version to compare to?

amplab / graphx

Former GraphX development repository. GraphX has been merged into Apache Spark; please submit pull requests there.

https://github.com/apache/spark

Apache License 2.0

360 stars 103 forks source link

[Question] Which graphx version to compare to? #138

Closed bmyerz closed 10 years ago

bmyerz commented 10 years ago

We are comparing performance on some of the example applications in Analytics.scala. Results show GraphX pagerank performs within a factor of 2-3 of Graphlab. I haven't yet been able to get near this performance.

configurations:

twitter 1.4B edge
16 nodes
branch master
command: bin/spark-class -Dspark.executor.memory=52g org.apache.spark.graphx.lib.Analytics <master> pagerank <graph> --tol=0.01 --numEPart=16

Do you have a pointer to which branch and what methodology I should use to replicate these results?

ankurdave commented 10 years ago

Those benchmarks were obtained with the sigmod-pr and sigmod-cc branches. The main optimization in these branches is an in-memory shuffle for Spark, which provided a 2-3x performance improvement for us on EC2.

We recently reran the PageRank and connected components benchmarks with the vldb branch, which also includes this optimization. I suggest using this branch since it's more up to date.

bmyerz commented 10 years ago

Thanks, I'm still not getting the performance I expect by a few x. Do you know the options that provided the best performance? In particular,

did you use the provided pagerank/cc that are implemented with Pregel (called from Analytics)
type of partitioning
numEPart, numVPart (as a function of num workers).

ankurdave commented 10 years ago

@dcrankshaw can add details and corrections, but as I recall we used a modified version of the Analytics driver (./bin/spark-class org.apache.spark.graphx.lib.Analytics HOST {pagerank,cc} FILENAME) where we reduced the number of partitions by calling RDD#coalesce on the edge RDD before passing it to the graph constructor.

We experimented with the type and number of partitions and found that fewer partitions was better, and that the optimal partitioning strategy depended on the input. For number of partitions, we settled on 64 partitions for a cluster with 16 8-core workers. This is fewer than the number of cores, so we wasted some compute capacity in exchange for more efficient shuffles. For partitioning strategy, if the edge list was already sorted on disk we left the graph in its default partitioning. Otherwise, we used RandomVertexCut.

bmyerz commented 10 years ago

Are you referring to something like this? https://github.com/bmyerz/graphx/tree/coalesce-edges I suppose I thought parsing and using EdgeBuilder would preserve the partitions as minEdgePartitions passed to textFile

I haven't exhausted the space of parameters, but so far for 64 nodes: 1 partition per node on 16-core nodes is as fast as anything else.

ankurdave commented 10 years ago

Sorry I didn't get back to you. You actually have to coalesce before building the edge partitions, like this: https://github.com/amplab/graphx2/commit/ad4c87467f06f0bb244e0ce7571bf2bba144ac57

But that probably won't give you more than 50% improvement or so.