dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.57k stars 3.02k forks source link

Add support for mask in reduce UDFs? #736

Closed lukovnikov closed 4 years ago

lukovnikov commented 5 years ago

❓ Questions and Help

During reduce, it seems DGL makes multiple calls to the provided reduce function during .recv(), where in each call a group of nodes with mailboxes of the same size is processed. Is there an option to process all of the mailboxes in one call of the reduce UDF (to maximize GPU usage), where an additional mask argument is stored on the mailboxes; or do you plan to support this (perhaps as an optional "reduce_with_mask" argument to .recv())?

jermainewang commented 5 years ago

Hi, I wonder what kind of reduce function do you wish to realize? Would you please also check out our built-in reduce functions? They are mapped to dedicated GPU kernels and are pretty fast. We actually encourage to use built-in first while UDFs are for quick prototyping or debugging. Let us know if the builtins cannot cover your case.

lukovnikov commented 5 years ago

Hi, thanks for the link. I am implementing a simple attention layer, and I'm generating attention weights and summary in reduce during .recv() on the query nodes (I'm sending vectors of key nodes along selected attention edges). Is there a better way?

jermainewang commented 5 years ago

Does our Graph Attention Network example help your case?

lukovnikov commented 5 years ago

Great! Thanks for the tip, it helps! Although I'm still wondering whether it wouldn't be faster (at the cost of some wasted compute) for custom reduce funcs to optionally support masking. In the GAT tutorial, softmax and summary happen in reduce, and probably would take several calls with small batches. What do you think?

yzh119 commented 5 years ago

Hi @lukovnikov , our default parallel strategy for custom reduce functions is degree bucketing, which means it would batch nodes with the same incoming degrees and execute the reduce function for these nodes together. It's not satisfactory if the degree variance in the graph is large.

Use dgl builtin functions (introduced in DGL 0.3) would be the best solution because it fuses kernels and select best parallel strategy accordingly. For attention, the building block is basically the followings:

  1. image
  2. image
  3. image

where mask, A, E are sparse matrices. If you use builtin function, for formula 1 you just need to call:

import dgl.function as fn
g.apply_edges(fn.u_mul_v('k', 'q', 'e'))
g.edata['e'] = g.edata['e'].sum(dim=-1)

it would only compute the values where adj is not zero. The parallel strategy is edge parallel, it enables maximum number of apply_edge functions to be paralleled together and no computation is wasted (we only compute the positions where adj is not zero). If you are interested in the magic behind it, please check https://github.com/dmlc/dgl/blob/master/src/kernel/cuda/binary_reduce_impl.cuh.

For formula 2, we also have builtin softmax functions, it decomposes the softmax into several phases:

  1. First we compute the maximum value on all edges connected to each node, we call it smax
  2. Then we substract each edge value with the smax stored in its destination node. (The aim of this trick is to avoid extremely large numbers: https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/)
  3. Compute the exp of each edge value.
  4. Compute the sum of edge values connected to each node, we call it out_sum, used for normalization.
  5. Divide edge value with the normalization term out_sum stored in its destination node.

We have implemented the routine, and you may refer to https://github.com/dmlc/dgl/blob/master/python/dgl/nn/pytorch/softmax.py#L28. Each phase could be achieved by simple message passing with builtin functions, which is made highly parallel.

Formula 3 is just a sparse matrix multiplication operation, and we use cusparse for best performance.

In nutshell, builtin functions could help you find the best parallel strategy so that users do not need to worry about the performance, and we are still working on improving the backend to make full use of shared memory in GPU and so on (in another words, it could be made faster in the future). We hope the builtin functions are general enough to cover most of the use cases. If you find operations that is extremely hard to code, please inform us and we are glad to help.