dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.51k stars 3.02k forks source link

[RFC] Graph level data #737

Closed jermainewang closed 5 years ago

jermainewang commented 5 years ago

Moved from #714 . Currently, we have node-level and edge-level data (e.g. ndata and edata). The RFC proposes to add a dictionary for graph-level data (using torch syntax as an example):

g.gdata['embed'] = torch.randn((100,)) # could be multi-dimension array
g.gdata['label'] = torch.array([0]) 

I could foresee following considerations

  1. Restriction on the graph data. Our node-level data and edge-level data have to be ndarrays. Shall we restrict the graph data to be the same? One subtlety is that when setting scalar type graph data, the user needs to use a length-one array (the 'label' data in the above example).
  2. Batching multiple graphs. dgl.batch needs to handle graph-level data too. The behavior should refer to how node/edge-level data are batched. For example, we need to throw errors if graph-level data under the same key cannot be stacked (or the key is missing for some graphs). Similar change needs to be made to dgl.unbatch.
  3. Graph storage format. A new section to store graph-level data needs to be added. This means if the RFC is accepted, it should wait for PR #728 to be finished.
  4. Message passing UDFs. We might need to add graph-level data frame to the arguments of node/edge UDFs.
    def my_edge_udf(edges, gdata):
      pass
    def my_node_udf(nodes, gdata):
      pass

    Compared with our previous suggestion to use function closure:

    some_graph_data = ...
    def my_edge_udf(edges):
      foo(some_graph_data)  # could use graph data here
      pass

    Also, the change will break backward compatibility. Any good idea?

Alternatives

Note that graph-level data could be maintained outside of DGLGraph because they are dense tensors and the most common operations are stacking/unstacking them. UDFs can also access them using function closure which is quite neat TBH. On the other hand, adding graph-level data needs change to the UDF signatures and potentially breaks user codes.

VoVAllen commented 5 years ago

Can we use edges.g.data/nodes.g.data in UDF instead of another argument graph_data? edges.g.data should be the same shape as edges.data[...], with the corresponding gdata at each row.

mufeili commented 5 years ago
  1. What can be some other possibilities if we do not restrict graph data to be ndarrays?
  2. We can also allow users to specify initializers if a key does not exist for some graphs.
  3. If there will not exist really serious performance issues, I think I'd prefer the suggested alternative and make the handle light. This will also allow more flexible operations for model developers.
VoVAllen commented 5 years ago

The reason why we want to add gdata is mainly under the BatchedGraph case. Otherwise the graph data can be handled by user. Therefore non-ndarray makes no sense when we want to batch them. I'm for restricting gdata only to be ndarrays. We can have a suggested namespace for user to store the non-ndarray information, such as the corresponding SMILES for the molecule graph. However, DGL won't do anything for that field.

Initializer is a bit weird. Why we need initializer is because we might do partial update on node/edge's frame. It doesn't seems a case in graph-level data.

I see one case gdata could be helpful is something like Deep Graph Infomax, that user want interaction between graph-level data and node-level data, when having multiple graph batched together.

yzh119 commented 5 years ago

My two cents: I love the idea of graph level data and agree with all the four points made by @jermainewang . If we use gdata in message-passing UDFs, I recommend we also design a set of builtin-operations: g_op_v, g_op_u, u_op_g and v_op_g and dedicated kernels for them.

As for compatibility, for now, we can add two sets of UDFs: gather, broadcast for graph level data and leave the interface of other UDFs unchanged;

For message/reduce functions, there are some cases gdata could be useful: image image we need to survey if such operations in existing papers could all be decomposed as: image image if so we do not need to let gdata appear in message/reduce functions.

gdata is not only useful for DGI, considering node pooling operations, we can write most of them as a combination of g_op_v/v_op_gs.

jermainewang commented 5 years ago

Summary of discussion:

Overall, there is no advantage to use dedicated graph level data API except syntax concern at the moment. Thus we reject this RFC.

ZhaofengWu commented 4 years ago

Following up on this, if we do something like [setattr(g, 'g_data_field', data) for g, data in zip(graphs, all_data)] where each data is a torch Tensor, could you elaborate how batching works? In other words if we do batched = dgl.batch(graphs), what happens to the g_data_field? Are they automatically concatenated? If not, how can we access the individual g_data_field of each graph from batched?

EDIT: re-raised in #1449

yobibyte commented 3 years ago

Summary of discussion:

* Graph level data could be implemented without introducing new APIs.

* The implementation effort depending on the specific scenario does not impose a significant difficulty.

  * To emulate the idea of `gdata`, the user could directly `setattr` to a graph object.
  * Batching graph level data is a simple concatenation.
  * When returning a batched graph and its accompany graph level data from a data loader, we suggest directly return a tuple (e.g., `(bg, labels)`). The practice is standard.

Overall, there is no advantage to use dedicated graph level data API except syntax concern at the moment. Thus we reject this RFC.

graph.local_scope() seems to be not working in the approach above. Whas is an expected workaround for clearing the global data without the API?