[Performance][Sampling] Parallelize CSRSliceRows() on the CPU with multiple threads.

🚀 Feature

The function CSRSliceRows() on the CPU currently is not parallelized (https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_csr.cc#L361), and as a result makes MultiLayerFullNeighborSampler quite slow.

Motivation

It's currently faster to use a MultiLayerNeighborSampler with fanouts equal to the maximum degree in the graph (when memory is sufficient), than to use MultiLayerFullNeighborSampler, despite the fact that no selection or random number generation needs to be performed. As full neighbor sampling is quite slow to begin with, this is problematic.

When sampling on the CPU and performing to_block() on the GPU, no sampling workers can be used, and this lack of parallelism hurts performance quite a bit.

Pitch

It could be parallelized similar to uniform sampling https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L72, with the caveat that we would need to wait until the global_prefix is calculated (https://github.com/dmlc/dgl/blob/master/src/array/cpu/rowwise_pick.h#L147), before allocating the output arrays in order to know the total number of edges in the subgraph.

dmlc / dgl

[Performance][Sampling] Parallelize CSRSliceRows() on the CPU with multiple threads. #3400

🚀 Feature

Motivation

Pitch