GraphChi / graphchi-cpp

GraphChi's C++ version. Big Data - small machine.
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf
803 stars 312 forks source link

GraphChi - disk-based large-scale graph computation

NOTE: This project has been recently moved from Google Code, and some of the wiki pages might be partly broken.

MIT Technology Review article about GraphChi: "Your laptop can now analyze big data"

NEW: Improved performance.

(October 21, 2013) In-edges are now loaded in parallel, improving performance on multicore machines significantly.

NEW: Graph contraction algorithms

Read about graph contraction technique, which we used to implemented efficient minimum spanning forest computation Graph Contraction Algorithms .

Discussion group

http://groups.google.com/group/graphchi-discuss

License

GraphChi is licensed under the Apache License, Version 2.0. Each source code file has full license information.

Introduction

GraphChi is a spin-off of the GraphLab ( http://www.graphlab.org ) -project from the Carnegie Mellon University. It is based on research by Aapo Kyrola ( http://www.cs.cmu.edu/~akyrola/) and his advisors.

GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph. Section 'Performance' contains some examples of applications implemented for GraphChi and their running times on GraphChi.

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop. It saves you from the hassle and costs of working with a distributed cluster or cloud services. We find it much easier to debug applications on a single computer than trying to understand how a distributed algorithm is executed.

In some cases GraphChi can solve bigger problems in reasonable time than many other available distributed frameworks. GraphChi also runs efficiently on servers with plenty of memory, and can use multiple disks in parallel by striping the data.

Even if you do require the processing power of high-performance clusters, GraphChi can be an excellent tool for developing and debugging your algorithms prior to deploying them to the cluster. For high-performance graph computation in the distributed setting, we direct you to GraphLab's new version (v2.1), which can now handle large graphs in astonishing speed. GraphChi supports also most of the new GraphLab v2.1 API (with some restrictions), making the transition easy.

GraphChi is implemented in plain C++, and available as open-source under the flexible Apache License 2.0.

Java version

Java-version of GraphChi: https://github.com/GraphChi/graphchi-java

Publication

GraphChi is part of the OSDI'12 proceedings. PDF of the paper can be downloaded here: https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf

Slides (OSDI talk): http://www.cs.cmu.edu/~akyrola/files/osditalk-graphchi.pptx

Collaborative Filtering Toolkit

Danny Bickson has ported several collaborative filtering algorithms from GraphLab to GraphChi: http://bickson.blogspot.com/2012/12/collaborative-filtering-with-graphchi.html

Features

Getting Started

Best way to get started is to start from the https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps page. Prior to that, you need to download the source code (no configuration or installation is required).

For an introduction on writing your own applications, read Creating-GraphChi-Applications.

Problems compiling on Mac

First, try: xcode-select --install

If compiler complains about missing "omp.h" (OpenMP library), here is a way you can install it:

A) Direct installation of binary (on Yosemite)

See intructions in https://wiki.helsinki.fi/display/HUGG/Installing+the+GNU+compilers+on+Mac+OS+X#InstallingtheGNUcompilersonMacOSX-InstructionsforMacOS10.10(Yosemite)withXcode6

B) Using Homebrew (Contributed by Jose Pablo Gonzalez): 1) Install homebrew ( http://brew.sh ) 2) brew tap homebrew/versions 3) Install a compiler: brew install apple-gcc42 4) Modify the Makefile to use the new compiler: CPP = g++-4.2 5) make

NOTE: you might want to use newer compiler version.

How GraphChi works

GraphChi is based on the Parallel Sliding Windows method which allows efficient asynchronous processing of mutable graphs from disk. See Introduction-To-GraphChi for description.

Performance

NOTE: Following performance figures are somewhat outdated. Current performance should be better.

In the table below, we have picked some recent running time results for large-scale graph problems from the literature, and run the same experiment using GraphChi. For GraphChi, we used a Mac Mini (2012 model), with 8 gigabytes of RAM and 256 gigabyte SSD drive.

While distributed clusters can solve the same problems faster than GraphChi on a single computer, for many purposes GraphChi's performance should be adequate. The numbers below do not include time for transferring the input to cloud or cluster, and usually do not include the graph loading time. GraphChi's running times include loading the graph and saving the results, but not preprocessing time. Preprocessing needs to be done only once per graph (you can run many different algorithms on the same preprocessed graph). The preprocessing times are listed in a separate table below.

Application Input graph Graph size Comparison GraphChi on Mac Mini (SSD) Ref
Pagerank - 3 iterations twitter-2010 1.5B edges Spark, 50 machines, 8.1 min 13 min 1
Pagerank - 100 iterations uk-union 3.8B edges Stanford GPS (Pregel), 30 machines, 144 min 581 min 2
Web-graph Belief Propagation (1 iter.) yahoo-web 6.7B edges Pegasus, 100 machines, 22 min 27 min 3
Matrix factorization (ALS), 10 iters Netflix 99M edges GraphLab, 8-core machine, 4.7 min 9.8 min 4
Triangle counting twitter-2010 1.5B edges Hadoop, 1636 machines, 423 mins 55 min 5

Performance on hard drive: We repeated the same experiments on hard-drive, and the running times are roughly double compared to SSD. The hard-drive performance is thus sufficient for many purposes.

Configuration: membudget_mb 3000 execthreads 2 loadthreads 2 niothreads 2 io.blocksize 1048576

1I. Stanton and G. Kliot. Streaming graph partitioning for large distributed graphs. 2012.
2 S.Salihoglu and J.Widom. GPS: A Graph Processing System. Technical Report, pages 1–32, Apr. 2012.
3 U. Kang, D. H. Chau, and C. Faloutsos. Inference of Beliefs on Billion-Scale Graphs. KDD-LDMTA’10, pages 1–7, June 2010.
4 http://graphlab.org/datasets.html (Retrieved June 30, 2012)
5 S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th international conference on World wide web, pages 607–614. ACM, 2011.

Preprocessing times

GraphPreprocessing time
Netflix1 min
Twitter-201010 min
uk-union33 min
yahoo-web37 min