dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.38k stars 3k forks source link

[DISCUSSION] [RFC] Multi-edge support #69

Closed BarclayII closed 5 years ago

BarclayII commented 5 years ago

Had a discussion with @ylfdq1118 today. Here is the conclusion we have reached:

Storage

In our current codebase, we store the graph as adjacency lists (node to node). We propose to directly change the storage in C++ backend to incidence lists, more specifically, an incidence list for inbound edges and another for outbound ones. This implies that we are no longer handling simple graphs and multi-graphs separately like in NetworkX. This will affect the implementation of the following backend methods:

EDIT4: after all, the changes in C++ interface only involves changing the returning type of get_eid to IdArray and batch_get_eid to EdgeArray.

User Interface

Right now, the following user interfaces assumes that no multi-edge exist:

Since the interfaces expects to take in and return tensors, we think that having multi-edges will complicate the semantics significantly for get_e_repr and set_e_repr in particular: the users often have to track of how many edges are between two given vertices (e.g. Given a multi-graph ((0, 1), (0, 1), (1, 2)), set_edge_repr([0, 1], [1, 2]) will actually require a tensor with 3 rows, quite unnatural).

EDIT2: Our suggestion is to

  1. For get_e_repr and set_e_repr, simply raise an error if multi-edge exists between any of the given source/target pairs. We can have nicer semantics later.
  2. For other methods, simply send/apply on all edges in between.
  3. Have a counterpart working on edge IDs for apply_edges, send, send_and_recv, and update_edge (e.g. sendon) as well (set_e_repr and get_e_repr already have their edge ID counterparts), to allow sending/applying on individual edges.

We concluded that switching to incidence matrices in the backend should not break compatibility on the current user interfaces.

Side note: Filtering vertices/edges

Since RGCN already involves edges with different "types", which are represented by integers, we can expect that later on we will have models requiring to perform graph operations on a "type" of node/edge at a time. To be even more general, the users can do stuff on a subset of nodes/edges satisfying a certain predicate. We think it will be nice to have a filtering function which returns a subset of given nodes/edges that satisfies a given predicate. For example:

# returns all edges which has the field t equal to 2
eids = g.filter_edges(ALL, lambda e: e['t'] == 2)
# returns the nodes within @nids that has a maximum component greater than 1
# equivalent to something like nids[g.get_n_repr()['h'][nids].max(1) > 1]
nids = g.filter_nodes(nids, lambda n: n['h'].max(1) > 1)

Of course, the signature does not have to be exactly like this.

gSPMV Optimization

One thing we have noticed is that, if we use incidence matrices in our storage, then builtin reduce functions can be always represented by a gSPMV between the inbound incidence matrix and the message tensors.

However, if the message function is also a builtin, we don't want to actually broadcast the messages onto edges since it will potentially take up much more space. So

EDIT: gSPMV on adjacency matrix with multi-edges is more subtle. Since multi-edges are analogous to duplicate entries in COO sparse matrix, depending on the message and reduce function form the actual implementation can be different:

RGCN example

Under the proposed changes, we expect that the RGCN code can be greatly simplified, and our behavior could match the official implementations. Namely, the message function is simply a batched dot product between the source tensor and the weight matrices corresponding to the edge types, while the reduce function is just a sum (@ylfdq1118 correct me if I'm wrong):

def msg(src, edges, W):
    # W is a 3D Tensor (n_types, in_features, out_features) inside a closure or a parameter in a PyTorch module
    return src[:, None] @ W[edges['type']]

reducer = builtin.sum

Comments and questions are welcome.

lingfanyu commented 5 years ago

For RGCN, I think the weight W should not be in the message signature, and message function should directly reference it (for example, using closure).

Here is what the code might look like:

class MessageFunc(nn.Module):
    def __init__(self):
        self.w = torch.Tensor(num_relations, in_feat, out_feat)

    def forward(self, src, edge):
        return torch.bmm(src['h'], self.w[edge['type']])

Then the propagation looks like: g.update_all(msg_func, fn.sum(), update_func)

And we need to lower the spmv executor from update_all, send_and_recv api to recv api so that all builtin reduce can be done as a spmv from edge space to node space. And if message func is also builtin, we can provide a shortcut to rewrite as a spmv from node space to node space. This is related to executor redesign @jermainewang and I previously discussed.

zheng-da commented 5 years ago

@BarclayII @ylfdq1118 Thanks for the proposal. I agree that it's better to use incident lists to support multi-graphs.

Each edge has a unique ID. It shouldn't be a big problem for the API of DGLGraph, such as get_e_repr and set_e_repr.

I wonder if it's too expensive to return lists of lists for get_eids, batch_in_edges and batch_out_edges. What is the problem of returning them as a single list?

How do you want to use the filtering function with other APIs? One possible way is to pass them to the APIs, such as update_all, to remove some nodes and edges on the fly during the computation. The question is whether this is necessary given the subgraph API? One advantage may be that filtering at the runtime may give more information to users to decide when to sample an edge or a vertex.

BarclayII commented 5 years ago

@BarclayII @ylfdq1118 Thanks for the proposal. I agree that it's better to use incident lists to support multi-graphs.

Each edge has a unique ID. It shouldn't be a big problem for the API of DGLGraph, such as get_e_repr and set_e_repr.

The current semantics of get_e_repr and set_e_repr takes in pairs of source/target vertices. Under this semantics, if the number of edges between those pairs are different, then the number of rows of the edge tensor will be different than the number of source/target pairs. Users will have to align the rows to which pair of nodes in this case, and we believe that doing so would be quite a hassle. They may as well use get/set_e_repr_by_id in this case. So we believe that adapting the current get/set_e_repr to multigraph is subtle and/or redundant.

I wonder if it's too expensive to return lists of lists for get_eids, batch_in_edges and batch_out_edges. What is the problem of returning them as a single list?

My mistake. I checked the implementation of batched inbound/outbound edge queries and there should be no problem, since we are returning a list of (source, target, edge ID) tuples. I removed that proposal.

That being said, do we need to support batched queries of predecessors/successors? I feel that this should be a nice and convenient addition, but I'm unsure whether we need it right now or in the near future.

How do you want to use the filtering function with other APIs? One possible way is to pass them to the APIs, such as update_all, to remove some nodes and edges on the fly during the computation. The question is whether this is necessary given the subgraph API? One advantage may be that filtering at the runtime may give more information to users to decide when to sample an edge or a vertex.

The point of having the filtering functions is to support selecting a subset of nodes/edges using predicates. For example, one might want to "sample all the nodes with the maximum hidden state component greater than 1" (as in my second example). The resulting node subset can as well change dynamically.

That being said, I believe that users can do the same thing manually (again in my second example). But personally I would prefer our framework to somehow support querying by predicates conveniently, by returning the selected node/edge IDs, subgraphing, or any other way.

jermainewang commented 5 years ago

Could some one briefly describes what is an "incident list"? What is the difference with the current adjacency list here?

zheng-da commented 5 years ago

My understanding is that "incident list" is more or less the same as "adjacency list". The only difference is that it allows duplicated edges. I simply use their term. @BarclayII @ylfdq1118 is my understanding correct?

BarclayII commented 5 years ago

Technically "adjacency list" is about "which nodes are the direct successors to the given node", while "incidence list" is about "which edges are outbound from the given node". So the C++ storage itself also contains everything we need for incidence lists. This is good news since we don't need to change the storage (sorry I missed that), and we only need to tweak our code to allow duplicate succ's.

BarclayII commented 5 years ago

If we don't have any further problems then I assume that I'll go ahead and implement these changes?

lingfanyu commented 5 years ago

In previous discussion with @jermainewang , we think the query API is running at relatively high level (meaning in python code) because query API involves frame storage. But since GraphIdx is decoupled from storage, I am not sure about the exact structure we are going to use in GraphIdx for efficient edge type querying (maybe just one extra layer of indirection). Also, we have concerns about whether MultiGraphIdx should be a separate C++ class.

I think we need to be crystal clear about the design since GraphIdx affects a lot of stuff. So just hold on for a while until I get time to work on this. Also, we need to discuss more and push a detailed design doc on GraphIdx for others to comment on.

BarclayII commented 5 years ago

Re query API: I'm thinking of an extra layer of indirection too. Basically right now the frame storage only allows indexing by column. We can additionally have a "row indexer" for row slicing, following the same design as pandas.DataFrame. Then we only need to pass in the row-sliced "subframe" into the predicate.

Re MultiGraphIdx: I'm not sure what's the concern of having GraphIdx handling both simple- and multi-graph. In my mind GraphIdx can be mostly kept as it is right now (except edge_id and edge_ids). But I do agree that we need to inspect it in more detail.

jermainewang commented 5 years ago

I see. I think the multi-edge design should be at the same level as C++ GraphIndex. That's why I'm really surprised when I heard from Lingfan that your guy's design has something with the frame storage. Currently, Lingfan and I are too busy (will finally be released after tomorrow). But I think there are several things worth thinking:

BarclayII commented 5 years ago

A cleaned up proposal is available at https://docs.google.com/document/d/12gSTTMUNGVt-ZJkj4VCxUUOYVdSh07tUHKcM3_NU7-c/edit?usp=sharing

Closing this issue for now.