dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.55k stars 3.01k forks source link

[Performance][Sampling] Parallelize CSRSliceRows() on the CPU with multiple threads. #3400

Open nv-dlasalle opened 3 years ago

nv-dlasalle commented 3 years ago

🚀 Feature

The function CSRSliceRows() on the CPU currently is not parallelized (https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_csr.cc#L361), and as a result makes MultiLayerFullNeighborSampler quite slow.

Motivation

It's currently faster to use a MultiLayerNeighborSampler with fanouts equal to the maximum degree in the graph (when memory is sufficient), than to use MultiLayerFullNeighborSampler, despite the fact that no selection or random number generation needs to be performed. As full neighbor sampling is quite slow to begin with, this is problematic.

When sampling on the CPU and performing to_block() on the GPU, no sampling workers can be used, and this lack of parallelism hurts performance quite a bit.

Pitch

It could be parallelized similar to uniform sampling https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L72, with the caveat that we would need to wait until the global_prefix is calculated (https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L147), before allocating the output arrays in order to know the total number of edges in the subgraph.

davidmin7 commented 3 years ago

Hi @nv-dlasalle, I have a parallel version of this function. I used for some very large scale in_edges(), so please try out my implemention #3409.