anilpacaci / streaming-graph-partitioning

Experimental Setup for Performance Analysis of Streaming Algorithms
Apache License 2.0
30 stars 4 forks source link

Streaming Graph Partitioning

Experimental framework for performance analysis of graph partitioning algorithms on graph computations. This framework has been developed as a part of an experimental study focusing on streaming algorithms for graph partitioning. For more details, please refer to our SIGMOD'19 paper:

Anil Pacaci and M. Tamer Özsu. 2019. Experimental Analysis of Streaming Algorithms for Graph Partitioning. 
In 2019 International Conference on Management of Data (SIGMOD ’19), 
June 30-July 5, 2019, Amsterdam, Netherlands. 
ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3299869.3300076

Organization

This testbed has two main components with different infrastructure based on the target workload: PowerLyra cluster for offline graph analytics and JanusGraph cluster for online graph queries.

PowerLyra is a distributed graph computation engine that employs differentiated computation and partitioning on skewed graphs. We added an edge-cut based computation engine and set of partitioning algorithm used in the paper to provide fair comparison between the edge-cut and the vertex-cut partitioning models. Modified version of PowerLyra used in our SIGMOD'paper can be accessed here.

JanusGraph is a distributed graph database optimized for storing and querying graphs across a multi-machine cluster. We modified the explicit partitioning option of JanusGraph to enable vertex to partition mapping to be provided by a third-party algorithm. We used Cassandra backend with range partitioning ByteOrderedPartitioner. Modified version of the JanusGraph used in or SIGMOD paper can be accessed here.

All the components are dockerized and bash scripts to automate initializing, running experiments etc. are provided.

Prerequisites

It is recommended to use Linux 16.04 LTS.

Please refer to PowerLyra and JanusGraph documentation for manual setups.

For automated docker setup, cluster of machines with Docker 18.09.5 is required.

Please make sure that Docker can be used as a non-root user in all the machines in the cluster, for details: Docker

In addition, automated scripts assume that passwordless ssh is setup between all machines in the cluster, for more details on how to setup passwordless ssh: passwordless-ssh.

Datasets

We use the following datasets in our experiments in the SIGMOD'19 paper:

Dataset Edges Vertices Max Degree Type
Twitter 1.46B 41M 2.9M Heavy Tailed
UK2007-05 3.73B 105M 975K Power-law
US-Road 58.3M 23M 9 Low-degree
LDBC SNB SF1000 3.6M 447M 3682 Heavy Tailed

Original Download Links:

All datasets are in edge-list format by default. Edge-cut streaming graph partitioning algorithms requires adjacency list representation, where each line represents a vertex and all of its outgoing edges:

vertex_id neighbour_1_id neighbour_2_id ... neighbour_n_id

We provide Spark scripts to convert from edge list format to adjacency list format. It simply uses groupbyKey to group all edges by the source vertex. See csv-converter

Note: We only use the friendship network (person_knows_person relationship) of the LDBC synthetic dataset. File with the friendship information can be find social_network/person_knows_person_0_0.csv in the generated dataset.

Offline Graph Analytics

container/analytics contains the Dockerfile and necessary files to build the PowerLyra docker image. Image is based on the modified version of the PowerLyra, used in our SIGMOD'19 paper. In a nutshell, image is a base Ubuntu 16.04 image, with openmpi libraries and PowerLyra binary executables compiled from PowerLyra installed. In addition, it includes ssh configuration to enable passwordless communication between PowerLyra Docker instances, and utility scripts to collect and process system logs.

scripts/analytics contains analytics.sh, the bash script that automates experiments, and analytics.conf, configuration parameters necessary to execute offline analytics experiments.

Parameter Setup

Main configuration file is the scripts/analytics/analytics.conf, which is read by analytics.sh script and containers/analytics/docker-compose.yml.

Following parameters have default values and therefore not recommended to change. Any change should be propagate to containers/analytics/docker-compose.yml

Following parameters have to be configured by the user.

containers/analytics/docker-compose.yml contains three parameters that have to be configured by the user. These are the directories in the host filesystem that will be mounted to PowerLyra images:

Note: For any configuration file or dataset, always use the relative path under the directories specified above.

Experiment Setup

Experiments are configured using a json configuration value under the directory specified in volumes.parameters.driver_opts.device setting. Each configuration file defines an experiment over a single dataset with multiple values for all other parameters. An example configuration:

{
    "name" : "soc-pokec",
    "runs" : {
        "dataset-name" : "soc-pokec"
        "snap-dataset" : "soc-pokec/soc-pokec-snap-combined.txt",
        "adj-dataset" : "soc-pokec/soc-pokec-adjacency-combined.txt",
        "log-folder" : "soc-pokec-logs",
        "result-file" : "soc-pokec-test-run",
        "nedges" : "30622564",
        "nvertices" : "1432693",
        "worker-per-node" : 1,
        "workers" : [4,8,16],
        "ingress" : [
            {"name" : "hdrf"},
            {"name" : "random"},
            {"name" : "fennel"}
        ],
        "algorithm" : [
            {
                "name" : "pagerank",
                "iterations" : 20
            },
            {
                "name" : "sssp",
                "source": 1
            },
            {"name" : "connected_component"}
        ]
    }
}

This configuration file defines 18 individual runs over the soc-pokec dataset. snap-dataset, adj-dataset, log-folder, and result-folder are all relative paths under the directories defined in docker-compose.yml file. It uses HDRF, Random and FENNEL partitioning algorithms with PageRank and Weakly Connected Components algorithm. For each combination, three different configurations for cluster size is defined via workers parameter.

Configuration files used for experiments in SIGMOD'19 paper is provided under containers/analytics/parameters:

Run Experiments

First change working directory to scripts/analytics and make sure that swarm.conf is defined in the same directory.

$ ./analytics.sh build

It starts a private docker service registry in the master machine, then builds and deploys the PowerLyra docker image to this registry where worker nodes can pull the built contaier.

$ ./analytics.sh start

Start a PowerLyra instance on each node defined in DOCKER_WORKER_NODES.

$ ./analytics.sh run config_file

Run a set of experiments defined in config_file. Note that config_file must be a relative path under volumes.parameters.driver_opts.device directory.

$ ./analytics.sh plot [config_file1, config_file2, config_file3]

Run gnuplot scripts to generate plots from logs of the specified runs. Config files are the ones that are used in the previous step

$ ./analytics.sh stop

Stop PowerLyra container and the PowerLyra service in the Swarm cluster.

$ ./analytics.sh destroy

Removes the overlay network and forces each node in the cluster to leave the swarm.

Manual Runs

If you want to add additional partitioning algorithms and/or offline graph analytic tasks, you should moodify the PowerLyra codebase. More instructions on how to compile/build PowerLyra can be find here

It is possible to use the Docker images and scripts provided in this repository with a modified PowerLyra version. Once the modified PowerLyra codebase is compiled and new binaries are generated, deploy the release binaries under containers/analytics/powerlyra folder, then follow the steps described above to build Docker images and run experiments.

Online Graph Queries

Setup used in our SIGMOD'19 paper consists of a JanusGraph instance (version 0.3.0) and a Cassandra (version 2.1.9) instance for each node in the cluster:

JanusGraph does not support graph partitioning algorithms, therefore we modified JanusGraph to use an external partitioning information (provided by a file, loaded into Memcached etc.) during the graph loading phase. You can access the modified JanusGraph source code here, adn the Tinkerpop Gremlin Drivers here.

We provide a trivial implementations of edge-cut SGP algorithms (Hash, FENNEL and LDG). First, set the configuration parameters under src/main/resources/sgp.properties. Then run following to produce a partitionig mapping that is ready to use by JanusGraph scripts:

$ mvn clean assembly:assembly
$ mvn exec:java -Dexec.mainClass="ca.uwaterloo.cs.sgp.streaming.EdgeCutSGP" -Dexec.args="sgp.properties"

container/interactive contains the Dockerfile and necessary files to build the JanusGraph + Cassandra docker images. Image is based on the modified version of the Janusgraph, used in our SIGMOD'19 paper. In a nutshell, image is a base Casssandra 2.1 image Janusgraph codebase compiled from JanusGraph installed. In addition, it includes ssh configuration to enable passwordless communication between JanusGraph Docker instances, and utility scripts to collect and process system logs.

scripts/interactive contains interactive.sh, the bash script that automates experiments, and interactive.conf, configuration parameters necessary to execute online graph queries experiments.

Parameter Setup

Similar to Offline Graph Analytics, the main configuration file is scripts/interactive/interactive.conf, and it is read by interactive.sh.

Following parameters have default values and they should NOT be modified. Any change should be propogated to Docker compose files under containers/interactive.

Following parameters have to be configured by the user:

containers/interactive/docker-compose.yml contains three parameters that have to be configured by the user. These are the directories in the host filesystem that will be mounted to JanusGraph images:

Note: For any configuration file or dataset, always use the relative path under the directories specified above.

Experiment Setup

Experiments are configured using a configuration file under the directory specified in above volumes.parameters.driver_opts.device directory. Unlike Offline Graph Analytics, interactive workloads require JanusGraph database to be populated before an experiment to be executed. Therefore, each dataset, partitioning algorithm and partition count triples require a separate configuration file and data loading. An example configuration file for a given parameter looks like:

graph.snb = true
graph.name = ldbc
input.base = ldbc/sf10_friendship_adjacency/
memcached.address = localhost:11211
thread.count = 32
batch.size = 50000
reporting.period = 60
partition.ingress = hash
partition.count = 4
partition.lookup = ldbc/sf1000_friendship_adjacency/part-00000.shuffle.hash.p4

Both input.base and partition.lookup are relative directories under volumes.datasets.driver_opts.device.

To populate a JanusGraph database from a given configuration file, execute:

$ ./interactive.sh import config_file"

Once JanusGraph database is populated, execute following to run interactive workloads that are used in the paper:

$ ./interactive.sh run config_file

This command first executes the entire workload to make sure that caches are warm, then runs noth 1-hop and 2-hop traversal workloads under medium and high load, and records all measurements under the result volume.

Run Experiments

First change working directory to scripts/interactive and make sure that interactive.conf is defined in the same directory.

$ ./interactive.sh usage

$ ./interactive.sh init

It starts the Swarm leader in the given node, then adds every node in the DOCKER_WORKER_NODES as Swarm workers. In additions, it creates an overlay network that connects JanusGraph instances.

$ ./interactive.sh build

It starts a private docker service registry in the master machine, then builds and deploys the JanusGraph docker image to this registry where worker nodes can pull the built contaier.

$ ./interactive.sh start

Start a Cassandra instance on each node defined in DOCKER_WORKER_NODES. Once Cassandra cluster is up and running, configures a JanusGraph instance on each node.

$ ./interactive.sh import config_file

Populate a JanusGraph database from a given configuration file. Note that this step might take a long time depending on the graph size, and it needs to be repeated for each data point (partitioning algorithm, partition count and dataset)

$ ./interactive.sh run config_file

Run all experiments and record measurements on database that is defined by the config_file in the previous step. Note that config_file must be a relative path under volumes.parameters.driver_opts.device directory.

$ ./interactive.sh plot [config_file1, config_file2, config_file3]

Run gnuplot scripts to generate plots from logs of the specified runs. Config files are the ones that are used in the previous step

$ ./interactive.sh stop

Stop JanusGraph container and the JanusGraph services in the Swarm cluster.

$ ./interactive.sh destroy

Manual Runs

In each JanusGraph instance, explicit partitioning should be enabled by setting:

ids.placement=org.janusgraph.graphdb.database.idassigner.placement.PartitionAwarePlacementStrategy
ids.placement-history=[memcached|inmemory]
ids.placement-history-hostname=192.168.152.153:11211
ids.loader-mapping=[memcached|inmemory]
ids.loader-mapping-hostname=192.168.152.153:11211

It enables graph loaders to use explicit mapping loaded into Memcached. This way, any partitioning scheme produced by an external algorithm can be enforced on JanusGraph vertex-to-machine allocation scheme.

In addition, JanusGraph client driver uses the Partition-Aware routing strategy (modified Tinkerpop library with the partition aware router can be found here) that routes queries based on the location of the start vertex of the query. It is a necessary step as JanusGraph routes queries to a random instance by default.

Cassandra instances are configured to use ByteOrderedPartitioner to enable range partitioning. By defalt, JanusGraph divides the entire key space of Cassandra to equal partitions, and computes an id from a particular partitions range to enforce vertex placement on a particular partition. In order to set ByteOrderedPartitioner, change conf/cassandra.yaml under Cassandra installation on each node:

partitioner: org.apache.cassandra.dht.ByteOrderedPartitioner

scripts/interactive contains following scripts for graph loading:

scripts/interactive/snb.conf contains the parameters for execution of the loader scripts: