Fused Neigbhor Sampling reproducibility issue

Setting the seed and repeating the fused neighborhood sampling for a source code does not reproduce the same subgraph, have identified a fix that will be slower but allow reproducible subgraphs

To Reproduce

Steps to reproduce the behavior:

Set OMP_NUM_THREADS to >= 2
Define NeighborSampler with fused=True
Set dgl.seed, dgl.random.seed before calling dgl.dataloading.NeighborSampler.sample_blocks
Repeat step 3. and compare blocks[0].srcdata['feat']

The following change allowed for the above to have the same output

diff --git a/src/graph/sampling/neighbor/neighbor.cc b/src/graph/sampling/neighbor/neighbor.cc
index 5393a200..471600dd 100644
--- a/src/graph/sampling/neighbor/neighbor.cc
+++ b/src/graph/sampling/neighbor/neighbor.cc
@@ -14,6 +14,7 @@

 #include <tuple>
 #include <utility>
+#include <execution>

 #include "../../../array/cpu/concurrent_id_hash_map.h"
 #include "../../../c_api_common.h"
@@ -488,6 +489,7 @@ SampleNeighborsFused(
         }
       }
       IdType offset = new_nodes_vec[lhs_node_type].size();
+      auto length = offset;
       new_nodes_vec[lhs_node_type].resize(global_prefix_col.back());
       for (int thread_id = 0; thread_id < num_threads_col; ++thread_id) {
         memcpy(
@@ -496,6 +498,7 @@ SampleNeighborsFused(
             src_nodes_local[thread_id].size() * sizeof(IdType));
         offset += src_nodes_local[thread_id].size();
       }
+    std::sort(std::execution::par_unseq, new_nodes_vec[lhs_node_type].begin() + length, new_nodes_vec[lhs_node_type].end());
     }
   }

This results is the sampling being slower but reproducible, would there be any alternative for multithreaded fused sampling being reproducible?

Environment

DGL Version (e.g., 1.0): 2.4
Backend Library & Version : Torch 2.3.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source): cmake -DBUILD_TYPE='release' -DUSE_LIBXSMM=ON -DUSE_OPENMP=ON -DBUILD_SPARSE=OFF -DUSE_EPOLL=OFF ..
Python version: 3.9
CPU: Intel Xeon

dmlc / dgl

Fused Neigbhor Sampling reproducibility issue #7831

To Reproduce

Environment