It's currently faster to use a MultiLayerNeighborSampler with fanouts equal to the maximum degree in the graph (when memory is sufficient), than to use MultiLayerFullNeighborSampler, despite the fact that no selection or random number generation needs to be performed. As full neighbor sampling is quite slow to begin with, this is problematic.
When sampling on the CPU and performing to_block() on the GPU, no sampling workers can be used, and this lack of parallelism hurts performance quite a bit.
🚀 Feature
The function
CSRSliceRows()
on the CPU currently is not parallelized (https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_csr.cc#L361), and as a result makes MultiLayerFullNeighborSampler quite slow.Motivation
It's currently faster to use a MultiLayerNeighborSampler with fanouts equal to the maximum degree in the graph (when memory is sufficient), than to use MultiLayerFullNeighborSampler, despite the fact that no selection or random number generation needs to be performed. As full neighbor sampling is quite slow to begin with, this is problematic.
When sampling on the CPU and performing to_block() on the GPU, no sampling workers can be used, and this lack of parallelism hurts performance quite a bit.
Pitch
It could be parallelized similar to uniform sampling https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L72, with the caveat that we would need to wait until the
global_prefix
is calculated (https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L147), before allocating the output arrays in order to know the total number of edges in the subgraph.