Closed BarclayII closed 5 years ago
For RGCN, I think the weight W
should not be in the message signature, and message function should directly reference it (for example, using closure).
Here is what the code might look like:
class MessageFunc(nn.Module):
def __init__(self):
self.w = torch.Tensor(num_relations, in_feat, out_feat)
def forward(self, src, edge):
return torch.bmm(src['h'], self.w[edge['type']])
Then the propagation looks like: g.update_all(msg_func, fn.sum(), update_func)
And we need to lower the spmv executor from update_all
, send_and_recv
api to recv
api so that all builtin reduce can be done as a spmv from edge space to node space. And if message func is also builtin, we can provide a shortcut to rewrite as a spmv from node space to node space. This is related to executor redesign @jermainewang and I previously discussed.
@BarclayII @ylfdq1118 Thanks for the proposal. I agree that it's better to use incident lists to support multi-graphs.
Each edge has a unique ID. It shouldn't be a big problem for the API of DGLGraph
, such as get_e_repr
and set_e_repr
.
I wonder if it's too expensive to return lists of lists for get_eids
, batch_in_edges
and batch_out_edges
. What is the problem of returning them as a single list?
How do you want to use the filtering function with other APIs? One possible way is to pass them to the APIs, such as update_all
, to remove some nodes and edges on the fly during the computation. The question is whether this is necessary given the subgraph API? One advantage may be that filtering at the runtime may give more information to users to decide when to sample an edge or a vertex.
@BarclayII @ylfdq1118 Thanks for the proposal. I agree that it's better to use incident lists to support multi-graphs.
Each edge has a unique ID. It shouldn't be a big problem for the API of
DGLGraph
, such asget_e_repr
andset_e_repr
.
The current semantics of get_e_repr and set_e_repr takes in pairs of source/target vertices. Under this semantics, if the number of edges between those pairs are different, then the number of rows of the edge tensor will be different than the number of source/target pairs. Users will have to align the rows to which pair of nodes in this case, and we believe that doing so would be quite a hassle. They may as well use get/set_e_repr_by_id in this case. So we believe that adapting the current get/set_e_repr to multigraph is subtle and/or redundant.
I wonder if it's too expensive to return lists of lists for
get_eids
,batch_in_edges
andbatch_out_edges
. What is the problem of returning them as a single list?
My mistake. I checked the implementation of batched inbound/outbound edge queries and there should be no problem, since we are returning a list of (source, target, edge ID) tuples. I removed that proposal.
That being said, do we need to support batched queries of predecessors/successors? I feel that this should be a nice and convenient addition, but I'm unsure whether we need it right now or in the near future.
How do you want to use the filtering function with other APIs? One possible way is to pass them to the APIs, such as
update_all
, to remove some nodes and edges on the fly during the computation. The question is whether this is necessary given the subgraph API? One advantage may be that filtering at the runtime may give more information to users to decide when to sample an edge or a vertex.
The point of having the filtering functions is to support selecting a subset of nodes/edges using predicates. For example, one might want to "sample all the nodes with the maximum hidden state component greater than 1" (as in my second example). The resulting node subset can as well change dynamically.
That being said, I believe that users can do the same thing manually (again in my second example). But personally I would prefer our framework to somehow support querying by predicates conveniently, by returning the selected node/edge IDs, subgraphing, or any other way.
Could some one briefly describes what is an "incident list"? What is the difference with the current adjacency list here?
My understanding is that "incident list" is more or less the same as "adjacency list". The only difference is that it allows duplicated edges. I simply use their term. @BarclayII @ylfdq1118 is my understanding correct?
Technically "adjacency list" is about "which nodes are the direct successors to the given node", while "incidence list" is about "which edges are outbound from the given node". So the C++ storage itself also contains everything we need for incidence lists. This is good news since we don't need to change the storage (sorry I missed that), and we only need to tweak our code to allow duplicate succ
's.
If we don't have any further problems then I assume that I'll go ahead and implement these changes?
In previous discussion with @jermainewang , we think the query API is running at relatively high level (meaning in python code) because query API involves frame storage. But since GraphIdx is decoupled from storage, I am not sure about the exact structure we are going to use in GraphIdx for efficient edge type querying (maybe just one extra layer of indirection). Also, we have concerns about whether MultiGraphIdx should be a separate C++ class.
I think we need to be crystal clear about the design since GraphIdx affects a lot of stuff. So just hold on for a while until I get time to work on this. Also, we need to discuss more and push a detailed design doc on GraphIdx for others to comment on.
Re query API: I'm thinking of an extra layer of indirection too. Basically right now the frame storage only allows indexing by column. We can additionally have a "row indexer" for row slicing, following the same design as pandas.DataFrame
. Then we only need to pass in the row-sliced "subframe" into the predicate.
Re MultiGraphIdx: I'm not sure what's the concern of having GraphIdx handling both simple- and multi-graph. In my mind GraphIdx can be mostly kept as it is right now (except edge_id
and edge_ids
). But I do agree that we need to inspect it in more detail.
I see. I think the multi-edge design should be at the same level as C++ GraphIndex. That's why I'm really surprised when I heard from Lingfan that your guy's design has something with the frame storage. Currently, Lingfan and I are too busy (will finally be released after tomorrow). But I think there are several things worth thinking:
A cleaned up proposal is available at https://docs.google.com/document/d/12gSTTMUNGVt-ZJkj4VCxUUOYVdSh07tUHKcM3_NU7-c/edit?usp=sharing
Closing this issue for now.
Had a discussion with @ylfdq1118 today. Here is the conclusion we have reached:
Storage
In our current codebase, we store the graph as adjacency lists (node to node). We propose to directly change the storage in C++ backend to incidence lists, more specifically, an incidence list for inbound edges and another for outbound ones. This implies that we are no longer handling simple graphs and multi-graphs separately like in NetworkX. This will affect the implementation of the following backend methods:
EDIT4: this should be fine.pred
andsucc
: Essentially, we will have to find the inbound/outbound edges of the given vertices first, then look up the source/target vertex of those edges (we can also have a predecessor/successor list in parallel to the inbound/outbound edges, but this is essentially duplicating the edge list and I'm not sure whether we would like it). These methods should support a boolean argumentunique
to dictate whether we expect a unique list of vertices (useful for graph traversal) or not (useful for deciding source/target nodes formessage
andapply_edge
functions).EDIT4: this should also be fine.vertex_subgraph
: Similar topred
/succ
, we need to do two-step queries.get_eid
,: Now it would return a list of edge IDs instead of one.in_edges
,out_edges
EDIT3: this should be fine since we are returningget_eids
,batch_in_edges
,batch_out_edges
: We haven't go through the semantics of this method yet. I would suggest returning a list of lists if possible, following the same practice asigraph.Graph.neighborhood()
. Returning a padded array (along with the lengths) can also do the job if we wish to enable padding (which is a separate functionality we wish to consider I assume).(source, target, edge ID)
tuples.EDIT4: after all, the changes in C++ interface only involves changing the returning type of
get_eid
toIdArray
andbatch_get_eid
toEdgeArray
.User Interface
Right now, the following user interfaces assumes that no multi-edge exist:
set_e_repr
get_e_repr
apply_edges
send
(and consequentlysend_and_recv
,pull
,push
, andupdate_all
)pull
,push
andupdate_all
only assumes that in the implementation due to usage ofsend
; the interface itself is fine.update_edge
Since the interfaces expects to take in and return tensors, we think that having multi-edges will complicate the semantics significantly for
get_e_repr
andset_e_repr
in particular: the users often have to track of how many edges are between two given vertices (e.g. Given a multi-graph ((0, 1), (0, 1), (1, 2)),set_edge_repr([0, 1], [1, 2])
will actually require a tensor with 3 rows, quite unnatural).EDIT2: Our suggestion is to
get_e_repr
andset_e_repr
, simply raise an error if multi-edge exists between any of the given source/target pairs. We can have nicer semantics later.apply_edges
,send
,send_and_recv
, andupdate_edge
(e.g.sendon
) as well (set_e_repr
andget_e_repr
already have their edge ID counterparts), to allow sending/applying on individual edges.We concluded that switching to incidence matrices in the backend should not break compatibility on the current user interfaces.
Side note: Filtering vertices/edges
Since RGCN already involves edges with different "types", which are represented by integers, we can expect that later on we will have models requiring to perform graph operations on a "type" of node/edge at a time. To be even more general, the users can do stuff on a subset of nodes/edges satisfying a certain predicate. We think it will be nice to have a filtering function which returns a subset of given nodes/edges that satisfies a given predicate. For example:
Of course, the signature does not have to be exactly like this.
gSPMV Optimization
One thing we have noticed is that, if we use incidence matrices in our storage, then builtin reduce functions can be always represented by a gSPMV between the inbound incidence matrix and the message tensors.
However, if the message function is also a builtin, we don't want to actually broadcast the messages onto edges since it will potentially take up much more space. So
EDIT: gSPMV on adjacency matrix with multi-edges is more subtle. Since multi-edges are analogous to duplicate entries in COO sparse matrix, depending on the message and reduce function form the actual implementation can be different:
copy_src
/src_mul_edge
and reduce functionsum
, on PyTorch we can still stick with existing SPMM multiplications by directly constructing the (weighted) adjacency matrix from edge lists; PyTorch will automatically coalesce duplicated entries by summing them up. I don't know if this is the case for MXNet and Tensorflow.max
, the coalescing operation is no longer a sum but something like max pooling, so direct sparse tensor construction in PyTorch no longer works.RGCN example
Under the proposed changes, we expect that the RGCN code can be greatly simplified, and our behavior could match the official implementations. Namely, the message function is simply a batched dot product between the source tensor and the weight matrices corresponding to the edge types, while the reduce function is just a sum (@ylfdq1118 correct me if I'm wrong):
Comments and questions are welcome.