Graph data structure and algorithm design

ViralBShah commented 11 years ago

There is a discussion on graph data structure and algorithm design that started out at:

https://github.com/JuliaLang/METADATA.jl/pull/91

I am opening this issue to bring that discussion here.

Cc: @daviddelaat @lindahua

daviddelaat commented 11 years ago

@lindahua The Boost Graph library is great, but wouldn't we need multiple inheritance for that?

lindahua commented 11 years ago

I was actually suggesting taking the approach that decouples the graph concepts / types and graph algorithms, such that one can write generic algorithms work on multiple graph types (that implement some concept) or write a specific graph type to fit his need, but have generic algorithms working on it.

In terms of concept/type hierarchy, BGL did use a multiple inheritance hierarchy. But we can use Union as work-around.

The BGL concept hierarchy as shown in Figure 1 in (http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/graph_concepts.html) can be expressed in Julia, as

abstract Graph

abstract IncidenceGraph <: Graph
abstract AdjacencyGraph <: Graph
abstract AdjacencyMatrix <: Graph
abstract BidirectionalGraph <: IncidenceGraph

abstract VertexListOnlyGraph <: Graph   # only provides the vertex list
abstract EdgeListOnlyGraph <: Graph     # only provides the edge list
abstract VertexAndEdgeListGraph <: Graph  # provides both vertex and edge list 
typealias VertexListGraph Union(VertexListOnlyGraph, VertexAndEdgeListGraph) 
typealias EdgeListGraph Union(EdgeListOnlyGraph, VertexAndEdgeListGraph)

I think this nearly reproduces the BGL concept hierarchy. The only issue is the relations between VertexListGraph, EdgeListGraph, and VertexAndEdgeListGraph. In BGL, they use multiple inheritance, but here I proposed to use Julia's Union to express similar relations.

For graph type author, they can choose whether to declare their type as a sub-type of VertexListOnlyGraph (if it makes sense to only provide the vertex list) or a sub-type of VertexAndEdgeListGraph (if it is efficient to provide both). For algorithm author that write algorithms relying on vertex lists, they should write function based on VertextListGraph.

lindahua commented 11 years ago

BGL provides a plethora of graph algorithms. Implementing such algorithms in an efficient way would take a lot of efforts. If we write the graph types using BGL's interfaces. Those algorithms can be ported to Julia much more easily.

daviddelaat commented 11 years ago

The union trick is very nice! This indeed comes very close to the BGL hierarchy. With this notation the graph type I had in the Networks.jl package would look something like this (see below), and the graph type in the Graphs.jl package could inherit EdgeListOnlyGraph. I think it would also be nice to incorporate the PropertyMap thing BGL has.

module Graphs

import 
    Base.start, 
    Base.done, 
    Base.next

export 
    Graph,
    IncidenceGraph, AdjacencyGraph, AdjacencyMatrix, BidirectionalGraph,
    VertexListOnlyGraph, EdgeListOnlyGraph, VertexAndEdgeListGraph, 
    VertexListGraph, EdgeListGraph,

    AdjacencyList,

    adj,
    has_vertex,
    add_vertex!,
    add_edge!

abstract Graph

abstract IncidenceGraph <: Graph
abstract AdjacencyGraph <: Graph
abstract AdjacencyMatrix <: Graph
abstract BidirectionalGraph <: IncidenceGraph

abstract VertexListOnlyGraph <: Graph     # only provides the vertex list
abstract EdgeListOnlyGraph <: Graph       # only provides the edge list
abstract VertexAndEdgeListGraph <: Graph  # provides both vertex and edge list 
typealias VertexListGraph Union(VertexListOnlyGraph, VertexAndEdgeListGraph) 
typealias EdgeListGraph Union(EdgeListOnlyGraph, VertexAndEdgeListGraph)  

type AdjacencyList{V, E} <: VertexListOnlyGraph
    adj::Dict{V, Dict{V, E}}
end

AdjacencyList(V,E) = AdjacencyList{V,E}(Dict{V,Dict{V,E}}())

adj(g::AdjacencyList) = g.adj

start{V,E}(g::AdjacencyList{V, E}) = start(g.adj)[1]
done{V,E}(g::AdjacencyList{V, E}, state) = done(g.adj, state)
function next{V,E}(g::AdjacencyList{V, E}, state)
   n = next(g.adj, state)
   n[1][1], n[2]
end

has_vertex{V,E}(g::AdjacencyList{V,E}, u::V) = has(adj(g), u)

function add_vertex!{V,E}(g::AdjacencyList{V,E}, u::V)
    if !has_vertex(g, u)
        adj(g)[u] = Dict{V, E}()
    end
    u
end

function add_edge!{V,E}(g::AdjacencyList{V,E}, u::V, v::V, e::E)
    add_vertex!(g, u)
    add_vertex!(g, v)
    adj(g)[u][v] = e
    adj(g)[v][u] = e
end

end

ViralBShah commented 11 years ago

Also cc: @jeffbezanson, @stefankarpinski

lindahua commented 11 years ago

@daviddelaat What you outlined above is pretty much what I was thinking about.

More thoughts about the package structure:

put the abstract graph types to a separated file (say abstract_graph.jl) with a clear document that explains what functions should be implemented for each abstract type.
then, we have a set of files that implement graph types, and another set of files that implement generic algorithms.

The property_map thing in Boost is like a C++ work around to provide named arguments. I think we should probably use Julia's approach here (use Options?)

Actually, there is another package that needs to be ported. The heaps -- this is vital for implementing for many important graph algorithms. For example, Dijkstra shortest path, Prim's algorithm on Minimum spanning tree, etc are relying on a heap. Fibonacci heap is generally recommended for these things, as it has lower amortized cost than say binary heap, but it is not as easy to implement.

Perhaps, we may port Boost.Heap to Heap.jl

One of the advantage of Julia is that we can implement such data structures and algorithms quite efficiently. (I can't really imagine someone can vectorize such things in MATLAB). Once we have these ready, Julia can clearly demonstrate its advantage over other technical computing languages (e.g. MATLAB and Python).

ViralBShah commented 11 years ago

See extras/heap.jl, which can serve as a starting point.

Fast operations on user defined data structures is really where julia shines. I did a bunch of parallel graph algorithms using distributed sparse matrices in Matlab + Star-P for my thesis, but the vectorization only takes you so far.

ViralBShah commented 11 years ago

It may also be useful to provide ref() and assign() to pull out edges, pull out vertices, add new ones, etc.

lindahua commented 11 years ago

Most graph algorithms don't actually rely on being able to do random access to pick individual vertices or edges.

What those concepts require is that you can provide an iteration interface (use begin() and end() to get forward Iterator in C++) so that the algorithm can traverse all vertices/edges. In Julia, we may provide something iterable instead.

So, basically we can have something like this

for v in vertices(g)
     # do something on vertex v
end

If it happens that vertices(g) is an array or something similar, we can definitely get individual vertex by writing vertices(g)[i]. But the general requirement should be that vertices(g) is iterable.

ViralBShah commented 11 years ago

The ability to do random indexing is good for interactive exploration of a graph, where you may pick a subgraph and examine it further, for example. In any case, I agree that vertices(g) should be iterable.

daviddelaat commented 11 years ago

Thinking about this a bit more. How do we specify that a function requires the graph to inherit more than one of the above abstract types. The breadth first search algorithm, for instance, requires the graph on which it operates to be both an IncidenceGraph and VertexListGraph. There is no Intersection function, right?

lindahua commented 11 years ago

In development of probabilistic inference stuff, I got to a point that needs the tree & graph data structure.

In particular, I need an efficient graph representation that does not explicitly store a vertex set (instead using 1:n by default to refer to vertices).

I am going to fork this and create a BGL branch to experiment with the BGL concepts along the development of some more graph types. When it is ready, I will do a pull request.

CC: @johnmyleswhite @ViralBShah @daviddelaat

johnmyleswhite commented 11 years ago

Sounds like a great idea to me.

daviddelaat commented 11 years ago

Sounds good! I'm definitely interested in seeing the BGL concepts modeled in Julia.

lindahua commented 11 years ago

I outlined the interface specification here: https://github.com/lindahua/Graphs.jl/blob/bgl/Interfaces.md

I continue to work on several specific graph types, and will wrap John's original graph types under this interface. Let me know if you have any feedback.

lindahua commented 11 years ago

There are some problems that I would like to listen to more suggestions before proceeding:

It is very common for a specific graph type to implement multiple concepts -- for example -- an incidence graph can also provide a list of edges / vertices. Or something can be simultaneously a tree and with an incidence graph.

In a more abstract level. We can decompose things into several orthogonal aspects: A, B, C, D. A particular type can provide some combination of them. I would like to express

abstract A
abstract B
abstract C
abstract D <: C
...

type G1 <: A, B
type G2 <: A, B, C
type G3 <: B, D

Julia does not allow to express this (neither does C++ actually -- C++ concept has not been accepted to standard).

What BGL did is to just document the interface of each concept, and document which concept a specific graph type implements. Instead of using the hierarchical system.

What about we do something similar. Only defines a AbstractGraph that everything should inherit. Then, just standardize the interface requirements in documents, and documents what interfaces a specific type provides, instead of constructing a hierarchy of Abstract types.

Graph algorithms always take arguments of type AbstractGraph and specify in document what combination of concepts it requires.

CC: @johnmyleswhite @ViralBShah @daviddelaat

ViralBShah commented 11 years ago

It seems reasonable to do what BGL has done. The hierarchy allows reuse of code - but in this case, the data structures actually change, requiring much of the interface to be rewritten anyways. I feel that it is best to just have AbstractGraph for now, and a bunch of other graph types that fully implement that interface.

ViralBShah commented 11 years ago

Thinking about sequencing the implementation, I would prefer that we have one general purpose but complete implementation for a specific type of graph, rather than partial implementations of lots of graph types.

lindahua commented 11 years ago

Update:

Following BGL convention, interfaces have been standardized. Instead of using a complex type hierarchy, the current system only uses one abstract type AbstractGraph.

Concrete graph types should be derived from AbstractGraph. In addition, I defined two macros: @graph_implements which declares which concepts a graph type implements, and a macro @graph_requires for checking whether a graph satisfies concept requirement. Here is some example to demonstrate how they are used

type MyGraph <: AbstractGraph
       ...  # some definitions
end

# declares that MyGraph implements all interfaces required by 
# vertex list, edge list, and adjacency graph
@graph_implements MyGraph vertex_list edge_list adjacency_graph

# on algorithm part
function bfs_traverse(g::AbstractGraph, ... )
    # check whether g implements the required concepts
    @graph_requires g vertex_list adjacency_list
     ....
end

In BGL, all requirements are only documented, and it does not actually enforce it through function interface. Therefore, when some requirements are violated, the errors will only be raised deep inside the algorithm implementation, usually resulting in miles-long mysterious error messages. The @graph_requires macro can detect and report such violations much earlier.

In addition, I implemented two generic graph types AdjacencyList and IncidenceList.

AdjacencyList is a generic type as

type AdjacencyList{V, VList, AList}
# V:     vertex type
# VList: the type of vertex list
# AList: the type of adjacency list. 
# Suppose a is an instance of AList, then a[v] is an iterable container or vertices.

This definition is very flexible, and can achieve different goals by using different parameters. For example,

AdjacencyList{Int, Range1{Int}, Vector{Vector{Int}}}

is very efficient. Here you simply use 1:n as a vertex list and does not actually store the vertex set, and retrieval of the neighbor list of a specific point is just an array index. My benchmark shows that with this representation, one can visit over one billion neighbor nodes per second (on my MacBook Pro).

In some cases, one might want to use symbols as vertices, and use dictionary to retrieve neighbor list of a specific vertex, then the type can be written as

AdjacencyList{Symbol, Set{Symbol}, Dict{Symbol, Vector{Symbol}}}

This representation contains more information, but is much less efficient. Therefore, with one generic type, one can achieve different trade-off between various goals.

IncidenceGraph is also a generic type with several type parameters. The difference is that each vertex maintains a list of incident edges (instead of neighboring vertices). This type is suitable for cases where edges are associated with important information.

A set of friendly constructing functions are provided such that users need not write such long type names in practice.

I believe some of the original graph types in this package can be wrapped as special case of these generic types.

I will play with the interfaces by implementing several graph algorithms.

ViralBShah commented 11 years ago

AdjacencyList{Int, Range1{Int}, Vector{Vector{Int}}} is essentially the same as SparseMatrixCSC. It would be great to see if code can be refactored and shared. Perhaps we should implement the benchmarks from the Graph500 benchmark. IIRC, one of them is pure BFS. Also, when I get back to sparse matrix stuff, it would be nice to compare performance of the sparse data structure with the graph data structure.

tautologico commented 11 years ago

It would be nice if Julia had some way of defining mixins, relying on macros to verify if a type implements some interface is definitely a kludge.

dgleich commented 11 years ago

Just wanted to point a few references which might provide a starting/point or point of comparison with sparse matrices.

https://github.com/dgleich/gaimc - graph algorithms in Matlab code basically just applies a few graph algorithms on top of the sparse matrix type.

For the Matlab code, we have to extract the CSR arrays ourselves (sparse_to_csr), but for Julia, you could use the internal ones directly. At one point in Matlab's JIT history, I got 4-times the speed of the BGL library below...

https://github.com/dgleich/matlab-bgl/tree/master/libmbgl - this is just a wrapper around the BGL functions that exports a more "c"-oriented interface on the CSR arrays for a sparse matrix. This has proven really annoying to maintain.

If you are looking for a better library to use, checkout igraph. That is a wonderful set of network analysis functions.

ViralBShah commented 11 years ago

Just want to point out:

http://graphlab.org http://kdt.sourceforge.net/wiki/index.php/Documentation http://gauss.cs.ucsb.edu/~aydin/CombBLAS/html/index.html

lindahua commented 11 years ago

Thanks for all the great references.

Under current design, it is very easy to wrap SparseMatrixCSC into a graph by simply wrapping it into a light weight wrapper type, and providing several basic interface. Then all algorithms that work on other graphs would work on sparse matrices.

I have used the MATLAB-BGL before, and I agreed @dgleich that there is considerable overhead at the boundary between MATLAB and the underlying BGL library.

Here, we are taking an approach that implements the Graph library in Julia rather than interfacing with a C library. This has several benefits:

Julia is able to handle complex data structure (such as graphs) efficiently. With proper implementation, LLVM that underlies Julia should be able to generate codes that are comparable to BGL(C++) itself ( ... well ..., "comparable" means within 2x, LLVM is not as aggressive in inlining some parts of the codes as a C++ compiler with -O3). This is a library with which we can demonstrate a key advantage of Julia to people (no one would be able to implement a graph class as efficient in pure Python/MATLAB)
The current design is generic and extensible (with pure Julia implementation). It is very easy for people to use their own vertex type, edge type, and even customized graph types. People can provide their own visitors in various graph algorithms. The generic graph algorithms apply as long as the interface functions are properly implemented.

By wrapping a C/C++ library, you may only expose a limited number of vertex types, edge types, and graph types. But the extensibility of current design is virtually unlimited.

Having said that, it may took a little bit more time to get this up and working (wrapping a C library is generally easier though). The current plan is to "translate" BGL codes (implement things in Julia, but take BGL and other open source libraries as reference for implementation).

I am now on an urgent deadline on mid April (ICCV 2013). And will resume the implementation after that.

ViralBShah commented 11 years ago

I agree with the general direction. This will also push the development of the language and the compiler. Having a language with first class support for regular and irregular data structures will be really awesome.

GlenHertz commented 11 years ago

Hi,

This looks great. I was wondering if the current plan covers the support of hypergraphs?

Thanks,

Glen

On Tue, Mar 26, 2013 at 9:43 AM, Viral B. Shah notifications@github.comwrote:

I agree with the general direction. This will also push the development of the language and the compiler. Having a language with first class support for regular and irregular data structures will be really awesome.

— Reply to this email directly or view it on GitHubhttps://github.com/johnmyleswhite/Graphs.jl/issues/16#issuecomment-15458671 .

lindahua commented 11 years ago

Latest updates:

The interface has been stabilized.

Classic graph algorithms have been implemented or are being implemented:

Tested & committed:
- BFS (supporting visitor)
- Geodesic distance (BFS using a visitor that keeps track of distances)
- DFS (supporting visitor)
- Detection of loops (based on DFS)
- Topological sort (based on DFS)
Implemented & being tested:
- Dijkstra algoritme (shortest paths from single source (formed by a vertex or a grouped of vertices))
- Connected components (over undirected graphs)
To be implemented today:
- Prim's algorithm for minimum spanning tree
- Kruskal's algorithm for minimum spanning tree/forest
- Floyd-Warshall's algorithm to find shortest distances between all pairs of vertices

Note that all these algorithms are implemented based on generic interfaces, so all of them can be applied to any graph classes that provide required set of interfaces.

I will propose to merge this branch once this basic set of algorithms are implemented and properly tested & a brief document is ready.

johnmyleswhite commented 11 years ago

That's great news. Thanks so much for your work on this, Dahua!

ViralBShah commented 11 years ago

This is pretty awesome. That's quite a decent set of algorithms.

One of the things I want to be able to do is call some of these algorithms on SparseMatrixCSC, and without making copies. Is that going to be possible?

lindahua commented 11 years ago

All algorithms listed above have been implemented and tested.

Currently, there are three types: EdgeList, AdjacencyList, and IncidenceList.

I am going to add a special type to represent the original graph type in this package, the one that allows attributes attached to vertices or edges.

SparseMatrixCSC can enjoy the functions here via a thin wrapper (without copying) that exposes the required interfaces (without copying).

lindahua commented 11 years ago

@johnmyleswhite would you please grant me the commit privilege to this package?

I would like to first move the BGL branch here, which would make things easier. I will request comments & review when I think BGL is ready, and will merge it to the master when people think it is good to merge.

johnmyleswhite commented 11 years ago

Done. Sorry that I hadn't realized you didn't have commit access before.

lindahua commented 11 years ago

Almost done. I am writing up a brief document.

rsofaer commented 11 years ago

Here are some things that could be helpful, from updating the dot serialization stuff right now:

First, document vertex_index(v), I re-implemented that before realizing it already existed. I was thinking of adding it to interface.rst, as "Vertex Interface". What do you think?

Second, I need to be able to check whether vertices, edges, and the whole graph have attributes, so I know if I need to serialize those as I traverse the graph.

Third, do you think it would be good to have slower edges(g) and vertices(g) functions for graphs that don't provide that concept? As in, a vertex iterator that runs across the edges, ignoring repetition, and an edge iterator that runs across the adjacency/incidence lists.

I'm going to move the random graph generation stuff out of original now, then go back to attribute serialization.

rsofaer commented 11 years ago

What do you think about making the interfaces explicit in code, and implementing @graph_requires by using method_exists! to check whether the required functions actually exist, like BOOST_CONCEPT_ASSERT? I think that would be very helpful in writing graph adapters.

lindahua commented 11 years ago

@rsofaer Checking existence of multiple methods is much more heavier than doing @graph_requires, which simply calls a function that returns a bool.

You can easily adapt any class using @graph_implements to declare what interfaces it is implementing.

rsofaer commented 11 years ago

@lindahua I still think that having the requirements of each concept be executable is valuable. As we have more graph interfaces, some of which will never be part of this repo, I think being able to verify that an interface you're building meets the requirements of the concepts you want to implement will create some peace of mind. What if @graph_implements gives an error or warning if the requirements of the concept aren't actually met?

gaborcsardi commented 10 years ago

Hi all!

Not much happening lately in this thread, and that's probably because you are happy with the data structure you have.

Still, I think soon I'll start experimenting with wrapping igraph in Julia. I was excited about doing this a long time ago, but then https://groups.google.com/forum/#!searchin/julia-dev/igraph$20julia$20wrapping/julia-dev/LzeQYQTdewM/QqR3vB0urSwJ cooled me down. It seemed you wanted to have your own graph package, which is understandable.

I think there are some exciting benefits in wrapping igraph, as opposed to another library, or having your own graph library from scratch.

A huge set of already implemented algorithms. It is fair to assume that Graph.jl will never catch up, nor will BGL, but probably it is not even their goal. This is obviously the biggest benefit for users.
igraph has an "interface generator", which can generate the interface code automatically, based on a description of the igraph functions. Much of the R wrapper is auto-generated, both the R and C code. So I don't have to write wrappers manually, at least not always.
igraph handles a lot of operations using callbacks. This is because we built it with having various high level interfaces in mind. In particular
attributes are handled by the embedding language, callbacks call back to R/Python. This means that attributes can be arbitrary objects from the embedding language, R objects in R, Python objects in Python, Julia objects in Julia;
random number generators are designed to use the embedding language's RNGs;
errors and warnings are reported natively according to the embedding language, in Julia you will get exceptions about igraph errors; memory is properly deallocated in case on an error in igraph or user interruption.

While igraph is huge, the API that deals with the actual data type is thin, 20 functions. The rest of the library are all using this API. This means that replacing the data type is very easy, a student replaced it with an SQL database, as an experiment, just by rewriting those 20 functions.

It is possible to replace the data structure with a Julia object, actually. Essentially, one would need to implement the 20 function API in Julia, and include some default callbacks to it from C. Assuming that calling back to Julia from C is not too costly, this can work just fine, although I am happy with the current data structure igraph has.

Of course I can do the wrapping on my own, I just thought I would let you know.

Disclaimer: I am a developer of igraph, so I am naturally biased.

Best, Gabor

dmbates commented 10 years ago

Very welcome news, @gaborcsardi

I am certainly willing to help if only in answering questions of the form, "I know how to do X in R, what is a good way of doing of doing X in Julia?"

JuliaAttic / OldGraphs.jl

Graph data structure and algorithm design #16