DGL DataLoader does not maintain example order with shuffle=False when using multiple workers.

🐛 Bug

DGL DataLoader does not maintain order of examples with shuffle=False when num_workers > 1 and batch_size * num_workers <= dataset_size. Elements in a single batch are in order but they are not in order across batches. This behavior seems inconsistent with the expected operation of a DataLoader when shuffle=False.

To Reproduce

Code:

import dgl
import torch

num_layers = 4

# Example dataset
random_embeddings = torch.randn(10000, 128)
target_values = torch.rand(10000)

# Indices of test nodes
test_start_index = 1000
test_size = 4096
test_mask = torch.zeros(10000, dtype=torch.bool)
test_mask[test_start_index:test_start_index + test_size] = True

# Create graph
dgl_graph = dgl.knn_graph(random_embeddings, 10, exclude_self=True)

dgl_graph.ndata['features'] = random_embeddings
dgl_graph.ndata['target'] = target_values

# Indices of 'test' elements
_nids = torch.where(test_mask)[0]

print("Number of rows:", dgl_graph.ndata['target'][test_mask].shape)

# Example 1
_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(num_layers)
_loader = dgl.dataloading.DataLoader(dgl_graph, _nids, _sampler, batch_size=1024, shuffle=False, drop_last=False,
                                     num_workers=4)

print('Example 1. 4 workers, batch size 1024. -> 4 * 1024 = 4096 (equal to test_size)')

_targets_iterated = []
for in_nodes, out_nodes, blocks in _loader:
    print(out_nodes[:5], out_nodes[-5:])
    _targets_iterated.append(blocks[-1].dstdata['target'])

_targets_iterated = torch.cat(_targets_iterated)

print(torch.equal(dgl_graph.ndata['target'][test_mask], _targets_iterated))

print('---')

# Example 2

print('Example 2. 4 workers, batch size 512. -> 4 * 512 = 2048 (less than test_size)')

_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(num_layers)
_loader = dgl.dataloading.DataLoader(dgl_graph, _nids, _sampler, batch_size=512, shuffle=False, drop_last=False,
                                     num_workers=4)

_targets_iterated = []
for in_nodes, out_nodes, blocks in _loader:
    print(out_nodes[:5], out_nodes[-5:])
    _targets_iterated.append(blocks[-1].dstdata['target'])

_targets_iterated = torch.cat(_targets_iterated)

print(torch.equal(dgl_graph.ndata['target'][test_mask], _targets_iterated))

print('---')

# Example 3

print('Example 3. 1 worker, batch size 512')

_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(num_layers)
_loader = dgl.dataloading.DataLoader(dgl_graph, _nids, _sampler, batch_size=512, shuffle=False, drop_last=False,
                                     num_workers=1)

_targets_iterated = []
for in_nodes, out_nodes, blocks in _loader:
    print(out_nodes[:5], out_nodes[-5:])
    _targets_iterated.append(blocks[-1].dstdata['target'])

_targets_iterated = torch.cat(_targets_iterated)

print(torch.equal(dgl_graph.ndata['target'][test_mask], _targets_iterated))

# Torch Dataloader

print('---')
print('pytorch')

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

dataset = MyDataset(torch.where(test_mask)[0])

dataloader = torch.utils.data.DataLoader(dataset, batch_size=512, shuffle=False, num_workers=4)

for batch in dataloader:
    print(batch[:5], batch[-5:])

Output:

Number of rows: torch.Size([4096])
Example 1. 4 workers, batch size 1024. -> 4 * 1024 = 4096 (equal to test_size)
tensor([1000, 1001, 1002, 1003, 1004]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([5091, 5092, 5093, 5094, 5095])
True
---
Example 1. 4 workers, batch size 512. -> 4 * 512 = 2048 (less than test_size)    <= here elements are not in order
tensor([1000, 1001, 1002, 1003, 1004]) tensor([1507, 1508, 1509, 1510, 1511])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([2531, 2532, 2533, 2534, 2535])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([3555, 3556, 3557, 3558, 3559])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([4579, 4580, 4581, 4582, 4583])
tensor([1512, 1513, 1514, 1515, 1516]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2536, 2537, 2538, 2539, 2540]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3560, 3561, 3562, 3563, 3564]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4584, 4585, 4586, 4587, 4588]) tensor([5091, 5092, 5093, 5094, 5095])
False
---
Example 1. 1 worker, batch size 512
tensor([1000, 1001, 1002, 1003, 1004]) tensor([1507, 1508, 1509, 1510, 1511])
tensor([1512, 1513, 1514, 1515, 1516]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([2531, 2532, 2533, 2534, 2535])
tensor([2536, 2537, 2538, 2539, 2540]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([3555, 3556, 3557, 3558, 3559])
tensor([3560, 3561, 3562, 3563, 3564]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([4579, 4580, 4581, 4582, 4583])
tensor([4584, 4585, 4586, 4587, 4588]) tensor([5091, 5092, 5093, 5094, 5095])
True
---
pytorch
tensor([1000, 1001, 1002, 1003, 1004]) tensor([1507, 1508, 1509, 1510, 1511])
tensor([1512, 1513, 1514, 1515, 1516]) tensor([2019, 2020, 2021, 2022, 2023])
tensor([2024, 2025, 2026, 2027, 2028]) tensor([2531, 2532, 2533, 2534, 2535])
tensor([2536, 2537, 2538, 2539, 2540]) tensor([3043, 3044, 3045, 3046, 3047])
tensor([3048, 3049, 3050, 3051, 3052]) tensor([3555, 3556, 3557, 3558, 3559])
tensor([3560, 3561, 3562, 3563, 3564]) tensor([4067, 4068, 4069, 4070, 4071])
tensor([4072, 4073, 4074, 4075, 4076]) tensor([4579, 4580, 4581, 4582, 4583])
tensor([4584, 4585, 4586, 4587, 4588]) tensor([5091, 5092, 5093, 5094, 5095])

Expected behavior

DataLoader to produces data in the same order as the input indices when shuffle=False, regardless of the number of workers or batch size.

Environment

DGL Version (e.g., 1.0): 2.3.0
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch 2.3.1+cu121
OS (e.g., Linux): "Ubuntu 22.04.3 LTS"
How you installed DGL (conda, pip, source): pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html
Build command you used (if compiling from source): -
Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA/cuDNN version (if applicable): -
GPU models and configuration (e.g. V100): -
Any other relevant information:

dmlc / dgl

DGL DataLoader does not maintain example order with shuffle=False when using multiple workers. #7695

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context