PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.23k stars 5.58k forks source link

[Targeting 2024 Q2] Dataloader crashes after enabling `persistent_workers=True` #48964

Closed Wong4j closed 3 months ago

Wong4j commented 1 year ago

bug描述 Describe the Bug

As a benchmark, I only need to train a few steps per epoch. So, I add a break in the loop. For example:

train_dataloader = paddle.io.DataLoader(dataset, batch_size=16, num_workers=4, persistent_workers=True)
bench_epochs = 3
bench_steps = 10
for epoch in range(bench_epochs):
      for i, batch in enumerate(train_dataloader):
          if i > bench_steps:
              break
          do_training_process()

It works fine if I set persistent_workers=False. But after setting persistent_workers=True, I got this error:

$ python test_dataloader.py 
W1209 00:24:39.929682 3970167 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.8, Runtime API Version: 11.7
W1209 00:24:39.943722 3970167 gpu_resources.cc:91] device: 0, cuDNN Version: 8.7.
Epoch 0 batch 0: loss = 2.582632303237915
Epoch 0 batch 1: loss = 2.553558588027954
Epoch 0 batch 2: loss = 2.5804834365844727
Epoch 0 batch 3: loss = 2.531757354736328
Epoch 0 batch 4: loss = 2.3217196464538574
Epoch 0 batch 5: loss = 2.3962247371673584
Epoch 0 batch 6: loss = 2.3609089851379395
Epoch 0 batch 7: loss = 2.398348808288574
Epoch 0 batch 8: loss = 2.594115734100342
Epoch 0 batch 9: loss = 2.648672342300415
Epoch 0 batch 10: loss = 2.4073853492736816
Traceback (most recent call last):
  File "test_dataloader.py", line 51, in <module>
    for i, (image, label) in enumerate(loader()):
  File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 746, in __next__
    data = _restore_batch(data, self._structure_infos.pop(0))
IndexError: pop from empty list

Here is the complete code to reproduce:

# cat test_dataloader.py 
import numpy as np

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset, BatchSampler, DataLoader

BATCH_NUM = 20
BATCH_SIZE = 16
EPOCH_NUM = 100
STEPS_PER_EPOCH = 10

IMAGE_SIZE = 784
CLASS_NUM = 10

USE_GPU = False # whether use GPU to run model

# define a random dataset
class RandomDataset(Dataset):
    def __init__(self, num_samples):
        self.num_samples = num_samples

    def __getitem__(self, idx):
        image = np.random.random([IMAGE_SIZE]).astype('float32')
        label = np.random.randint(0, CLASS_NUM - 1, (1, )).astype('int64')
        return image, label

    def __len__(self):
        return self.num_samples

dataset = RandomDataset(BATCH_NUM * BATCH_SIZE)

class SimpleNet(nn.Layer):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(IMAGE_SIZE, CLASS_NUM)

    def forward(self, image, label=None):
        return self.fc(image)

simple_net = SimpleNet()
opt = paddle.optimizer.SGD(learning_rate=1e-3,
                          parameters=simple_net.parameters())

loader = DataLoader(dataset,
                    batch_size=16,
                    num_workers=4,
                    persistent_workers=True)

for e in range(EPOCH_NUM):
    for i, (image, label) in enumerate(loader()):
        if i > STEPS_PER_EPOCH:
            break
        out = simple_net(image)
        loss = F.cross_entropy(out, label)
        avg_loss = paddle.mean(loss)
        avg_loss.backward()
        opt.minimize(avg_loss)
        simple_net.clear_gradients()
        print("Epoch {} batch {}: loss = {}".format(e, i, np.mean(loss.numpy())))

其他补充信息 Additional Supplementary Information

No response

paddle-bot[bot] commented 1 year ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

Wong4j commented 1 year ago

There's a related issue that was opened last year and remains unresolved. (https://github.com/PaddlePaddle/Paddle/issues/32927) This bug can be reproduced with my code by setting EPOCH_NUM = 100 and persistent_workers=False.

heavengate commented 1 year ago

persistent_workers=True还不是很稳定,我们还在优化这块的逻辑,根据你当前的背景,如果只需要训练一定的steps数,可以尝试先把dataset的__len__设置为steps * batch_size来做训练的中止,当前DataLoader训练中途做break有可能发生资源没有释放

heavengate commented 1 year ago

已排期,Q2内修复~

Wong4j commented 1 year ago

已排期,Q2内修复~

请问是否有更新?

onecatcn commented 1 year ago

@heavengate said this task will be targeted in 23 Q3

tiandou-tangdou commented 1 year ago

@heavengate said this task will be targeted in 23 Q3

@onecatcn done?

onecatcn commented 1 year ago

@heavengate said this task will be targeted in 23 Q3

@onecatcn done?

not yet

onecatcn commented 1 year ago

@xysheng-baidu sheng will investigate the issue

HydrogenSulfate commented 10 months ago

Same here

xysheng-baidu commented 9 months ago

Same here

The problem has not been solved yet, we will deal with it as soon as possible.

Wong4j commented 3 months ago

感谢修复,本地测试develop没问题。