[DISCUSSION] [RFC] Multi-edge support

BarclayII commented 5 years ago

Had a discussion with @ylfdq1118 today. Here is the conclusion we have reached:

Storage

In our current codebase, we store the graph as adjacency lists (node to node). We propose to directly change the storage in C++ backend to incidence lists, more specifically, an incidence list for inbound edges and another for outbound ones. This implies that we are no longer handling simple graphs and multi-graphs separately like in NetworkX. This will affect the implementation of the following backend methods:

pred and succ: Essentially, we will have to find the inbound/outbound edges of the given vertices first, then look up the source/target vertex of those edges (we can also have a predecessor/successor list in parallel to the inbound/outbound edges, but this is essentially duplicating the edge list and I'm not sure whether we would like it). These methods should support a boolean argument unique to dictate whether we expect a unique list of vertices (useful for graph traversal) or not (useful for deciding source/target nodes for message and apply_edge functions). EDIT4: this should be fine.
~~vertex_subgraph: Similar to pred/succ, we need to do two-step queries.~~ EDIT4: this should also be fine.
get_eid~~, in_edges, out_edges~~: Now it would return a list of edge IDs instead of one.
get_eids, batch_in_edges, batch_out_edges: We haven't go through the semantics of this method yet. I would suggest returning a list of lists if possible, following the same practice as igraph.Graph.neighborhood(). Returning a padded array (along with the lengths) can also do the job if we wish to enable padding (which is a separate functionality we wish to consider I assume). EDIT3: this should be fine since we are returning (source, target, edge ID) tuples.

EDIT4: after all, the changes in C++ interface only involves changing the returning type of get_eid to IdArray and batch_get_eid to EdgeArray.

User Interface

Right now, the following user interfaces assumes that no multi-edge exist:

set_e_repr
get_e_repr
apply_edges
send (and consequently send_and_recv, pull, push, and update_all)
- Note that pull, push and update_all only assumes that in the implementation due to usage of send; the interface itself is fine.
update_edge

Since the interfaces expects to take in and return tensors, we think that having multi-edges will complicate the semantics significantly for get_e_repr and set_e_repr in particular: the users often have to track of how many edges are between two given vertices (e.g. Given a multi-graph ((0, 1), (0, 1), (1, 2)), set_edge_repr([0, 1], [1, 2]) will actually require a tensor with 3 rows, quite unnatural).

EDIT2: Our suggestion is to

For get_e_repr and set_e_repr, simply raise an error if multi-edge exists between any of the given source/target pairs. We can have nicer semantics later.
For other methods, simply send/apply on all edges in between.
Have a counterpart working on edge IDs for apply_edges, send, send_and_recv, and update_edge (e.g. sendon) as well (set_e_repr and get_e_repr already have their edge ID counterparts), to allow sending/applying on individual edges.

We concluded that switching to incidence matrices in the backend should not break compatibility on the current user interfaces.

Side note: Filtering vertices/edges

Since RGCN already involves edges with different "types", which are represented by integers, we can expect that later on we will have models requiring to perform graph operations on a "type" of node/edge at a time. To be even more general, the users can do stuff on a subset of nodes/edges satisfying a certain predicate. We think it will be nice to have a filtering function which returns a subset of given nodes/edges that satisfies a given predicate. For example:

# returns all edges which has the field t equal to 2
eids = g.filter_edges(ALL, lambda e: e['t'] == 2)
# returns the nodes within @nids that has a maximum component greater than 1
# equivalent to something like nids[g.get_n_repr()['h'][nids].max(1) > 1]
nids = g.filter_nodes(nids, lambda n: n['h'].max(1) > 1)

Of course, the signature does not have to be exactly like this.

gSPMV Optimization

One thing we have noticed is that, if we use incidence matrices in our storage, then builtin reduce functions can be always represented by a gSPMV between the inbound incidence matrix and the message tensors.

However, if the message function is also a builtin, we don't want to actually broadcast the messages onto edges since it will potentially take up much more space. So

If both message and reduce functions are builtins, we still stick to gSPMV on adjacency matrix if possible.
Otherwise, if the reduce function is a builtin, we can do gSPMV on inbound incidence matrix.

EDIT: gSPMV on adjacency matrix with multi-edges is more subtle. Since multi-edges are analogous to duplicate entries in COO sparse matrix, depending on the message and reduce function form the actual implementation can be different:

For message function copy_src/src_mul_edge and reduce function sum, on PyTorch we can still stick with existing SPMM multiplications by directly constructing the (weighted) adjacency matrix from edge lists; PyTorch will automatically coalesce duplicated entries by summing them up. I don't know if this is the case for MXNet and Tensorflow.
For other cases, it is very likely that we need to write our own operators to coalesce the duplicate entries. For instance, if the reduce function is max, the coalescing operation is no longer a sum but something like max pooling, so direct sparse tensor construction in PyTorch no longer works.

RGCN example

Under the proposed changes, we expect that the RGCN code can be greatly simplified, and our behavior could match the official implementations. Namely, the message function is simply a batched dot product between the source tensor and the weight matrices corresponding to the edge types, while the reduce function is just a sum (@ylfdq1118 correct me if I'm wrong):

def msg(src, edges, W):
    # W is a 3D Tensor (n_types, in_features, out_features) inside a closure or a parameter in a PyTorch module
    return src[:, None] @ W[edges['type']]

reducer = builtin.sum

Comments and questions are welcome.

lingfanyu commented 5 years ago

For RGCN, I think the weight W should not be in the message signature, and message function should directly reference it (for example, using closure).

Here is what the code might look like:

class MessageFunc(nn.Module):
    def __init__(self):
        self.w = torch.Tensor(num_relations, in_feat, out_feat)

    def forward(self, src, edge):
        return torch.bmm(src['h'], self.w[edge['type']])

Then the propagation looks like: g.update_all(msg_func, fn.sum(), update_func)

And we need to lower the spmv executor from update_all, send_and_recv api to recv api so that all builtin reduce can be done as a spmv from edge space to node space. And if message func is also builtin, we can provide a shortcut to rewrite as a spmv from node space to node space. This is related to executor redesign @jermainewang and I previously discussed.

zheng-da commented 5 years ago

@BarclayII @ylfdq1118 Thanks for the proposal. I agree that it's better to use incident lists to support multi-graphs.

Each edge has a unique ID. It shouldn't be a big problem for the API of DGLGraph, such as get_e_repr and set_e_repr.

I wonder if it's too expensive to return lists of lists for get_eids, batch_in_edges and batch_out_edges. What is the problem of returning them as a single list?

How do you want to use the filtering function with other APIs? One possible way is to pass them to the APIs, such as update_all, to remove some nodes and edges on the fly during the computation. The question is whether this is necessary given the subgraph API? One advantage may be that filtering at the runtime may give more information to users to decide when to sample an edge or a vertex.

BarclayII commented 5 years ago

@BarclayII @ylfdq1118 Thanks for the proposal. I agree that it's better to use incident lists to support multi-graphs.

Each edge has a unique ID. It shouldn't be a big problem for the API of DGLGraph, such as get_e_repr and set_e_repr.

The current semantics of get_e_repr and set_e_repr takes in pairs of source/target vertices. Under this semantics, if the number of edges between those pairs are different, then the number of rows of the edge tensor will be different than the number of source/target pairs. Users will have to align the rows to which pair of nodes in this case, and we believe that doing so would be quite a hassle. They may as well use get/set_e_repr_by_id in this case. So we believe that adapting the current get/set_e_repr to multigraph is subtle and/or redundant.

I wonder if it's too expensive to return lists of lists for get_eids, batch_in_edges and batch_out_edges. What is the problem of returning them as a single list?

My mistake. I checked the implementation of batched inbound/outbound edge queries and there should be no problem, since we are returning a list of (source, target, edge ID) tuples. I removed that proposal.

That being said, do we need to support batched queries of predecessors/successors? I feel that this should be a nice and convenient addition, but I'm unsure whether we need it right now or in the near future.

How do you want to use the filtering function with other APIs? One possible way is to pass them to the APIs, such as update_all, to remove some nodes and edges on the fly during the computation. The question is whether this is necessary given the subgraph API? One advantage may be that filtering at the runtime may give more information to users to decide when to sample an edge or a vertex.

The point of having the filtering functions is to support selecting a subset of nodes/edges using predicates. For example, one might want to "sample all the nodes with the maximum hidden state component greater than 1" (as in my second example). The resulting node subset can as well change dynamically.

That being said, I believe that users can do the same thing manually (again in my second example). But personally I would prefer our framework to somehow support querying by predicates conveniently, by returning the selected node/edge IDs, subgraphing, or any other way.

jermainewang commented 5 years ago

Could some one briefly describes what is an "incident list"? What is the difference with the current adjacency list here?

zheng-da commented 5 years ago

My understanding is that "incident list" is more or less the same as "adjacency list". The only difference is that it allows duplicated edges. I simply use their term. @BarclayII @ylfdq1118 is my understanding correct?

BarclayII commented 5 years ago

Technically "adjacency list" is about "which nodes are the direct successors to the given node", while "incidence list" is about "which edges are outbound from the given node". So the C++ storage itself also contains everything we need for incidence lists. This is good news since we don't need to change the storage (sorry I missed that), and we only need to tweak our code to allow duplicate succ's.

BarclayII commented 5 years ago

If we don't have any further problems then I assume that I'll go ahead and implement these changes?

lingfanyu commented 5 years ago

In previous discussion with @jermainewang , we think the query API is running at relatively high level (meaning in python code) because query API involves frame storage. But since GraphIdx is decoupled from storage, I am not sure about the exact structure we are going to use in GraphIdx for efficient edge type querying (maybe just one extra layer of indirection). Also, we have concerns about whether MultiGraphIdx should be a separate C++ class.

I think we need to be crystal clear about the design since GraphIdx affects a lot of stuff. So just hold on for a while until I get time to work on this. Also, we need to discuss more and push a detailed design doc on GraphIdx for others to comment on.

BarclayII commented 5 years ago

Re query API: I'm thinking of an extra layer of indirection too. Basically right now the frame storage only allows indexing by column. We can additionally have a "row indexer" for row slicing, following the same design as pandas.DataFrame. Then we only need to pass in the row-sliced "subframe" into the predicate.

Re MultiGraphIdx: I'm not sure what's the concern of having GraphIdx handling both simple- and multi-graph. In my mind GraphIdx can be mostly kept as it is right now (except edge_id and edge_ids). But I do agree that we need to inspect it in more detail.

jermainewang commented 5 years ago

I see. I think the multi-edge design should be at the same level as C++ GraphIndex. That's why I'm really surprised when I heard from Lingfan that your guy's design has something with the frame storage. Currently, Lingfan and I are too busy (will finally be released after tomorrow). But I think there are several things worth thinking:

Read through the current GraphIndex class and play with it. You need to understand how it is used in the python side. The python binding is already complete so you can actually run it in python.
In the high level, start from the application/example, think about how the python side API will look like for users.
From the high level APIs, summarize a narrow low-level interface that needs to be implemented in C++.
Given the low-level APIs, how does the other existing packages implement them? (e.g. BGL, igraph, SGraph, graph-tool, etc.) From then, we can try to formalize the design and implement it.

BarclayII commented 5 years ago

A cleaned up proposal is available at https://docs.google.com/document/d/12gSTTMUNGVt-ZJkj4VCxUUOYVdSh07tUHKcM3_NU7-c/edit?usp=sharing

Closing this issue for now.

dmlc / dgl