Closed lukovnikov closed 4 years ago
Hi, I wonder what kind of reduce function do you wish to realize? Would you please also check out our built-in reduce functions? They are mapped to dedicated GPU kernels and are pretty fast. We actually encourage to use built-in first while UDFs are for quick prototyping or debugging. Let us know if the builtins cannot cover your case.
Hi, thanks for the link. I am implementing a simple attention layer, and I'm generating attention weights and summary in reduce during .recv() on the query nodes (I'm sending vectors of key nodes along selected attention edges). Is there a better way?
Does our Graph Attention Network example help your case?
Great! Thanks for the tip, it helps! Although I'm still wondering whether it wouldn't be faster (at the cost of some wasted compute) for custom reduce funcs to optionally support masking. In the GAT tutorial, softmax and summary happen in reduce, and probably would take several calls with small batches. What do you think?
Hi @lukovnikov , our default parallel strategy for custom reduce functions is degree bucketing, which means it would batch nodes with the same incoming degrees and execute the reduce function for these nodes together. It's not satisfactory if the degree variance in the graph is large.
Use dgl builtin functions (introduced in DGL 0.3) would be the best solution because it fuses kernels and select best parallel strategy accordingly. For attention, the building block is basically the followings:
where mask, A, E are sparse matrices. If you use builtin function, for formula 1 you just need to call:
import dgl.function as fn
g.apply_edges(fn.u_mul_v('k', 'q', 'e'))
g.edata['e'] = g.edata['e'].sum(dim=-1)
it would only compute the values where adj is not zero. The parallel strategy is edge parallel, it enables maximum number of apply_edge functions to be paralleled together and no computation is wasted (we only compute the positions where adj is not zero). If you are interested in the magic behind it, please check https://github.com/dmlc/dgl/blob/master/src/kernel/cuda/binary_reduce_impl.cuh.
For formula 2, we also have builtin softmax functions, it decomposes the softmax into several phases:
smax
smax
stored in its destination node. (The aim of this trick is to avoid extremely large numbers: https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/) out_sum
, used for normalization.out_sum
stored in its destination node. We have implemented the routine, and you may refer to https://github.com/dmlc/dgl/blob/master/python/dgl/nn/pytorch/softmax.py#L28. Each phase could be achieved by simple message passing with builtin functions, which is made highly parallel.
Formula 3 is just a sparse matrix multiplication operation, and we use cusparse for best performance.
In nutshell, builtin functions could help you find the best parallel strategy so that users do not need to worry about the performance, and we are still working on improving the backend to make full use of shared memory in GPU and so on (in another words, it could be made faster in the future). We hope the builtin functions are general enough to cover most of the use cases. If you find operations that is extremely hard to code, please inform us and we are glad to help.
❓ Questions and Help
During reduce, it seems DGL makes multiple calls to the provided reduce function during .recv(), where in each call a group of nodes with mailboxes of the same size is processed. Is there an option to process all of the mailboxes in one call of the reduce UDF (to maximize GPU usage), where an additional mask argument is stored on the mailboxes; or do you plan to support this (perhaps as an optional "reduce_with_mask" argument to .recv())?