PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.65k stars 2.87k forks source link

训练结束后跳出bug #6152

Closed Dandelion111 closed 2 years ago

Dandelion111 commented 2 years ago

问题确认 Search before asking

bug描述 Describe the Bug

把worker_num设置为6,会出现如下错误 41b70bf3dd79dd731b5b8fd2db1a93d 把worker_num设置为0,不会有这个错误,但是不管把worker_num设置为0还是6,都出现cpu内存持续增长的情况,各位大佬帮忙解答一下

复现环境 Environment

PaddlePaddle: 2.2.2 PaddleDetection: 2.4 python: 3.7 CUDA:11.4 cudnn: 8.3

是否愿意提交PR Are you willing to submit a PR?

yghstill commented 2 years ago

@Dandelion111 感谢反馈,是windows机器吗?结束时报错是每次必现的吗?

Dandelion111 commented 2 years ago

@Dandelion111 感谢反馈,是windows机器吗?结束时报错是每次必现的吗?

linux机器,worker_num只要设置为6每次都报错,我只尝试了0和6这两个值,设置为0不报错

Dandelion111 commented 2 years ago

@Dandelion111 感谢反馈,是windows机器吗?结束时报错是每次必现的吗?

我用pycharm训练的

yghstill commented 2 years ago

@Dandelion111 看下/dev/shm空间是不是满了?

df -h
Dandelion111 commented 2 years ago

@Dandelion111 看下/dev/shm空间是不是满了?

df -h

你好,这个没有满,才用了1%,我又试了下,训练picodet_xs模型200轮cpu大概增加了10g内存,训练picodet_s模型并且把worker_num设置为0的时候情况会好一些,之前训练ppyoloe也没有这个问题,问题是我同事和我用的一个服务器,他那边都正常,我俩的训练环境也一致,

Dandelion111 commented 2 years ago

@Dandelion111 看下/dev/shm空间是不是满了?

df -h

具体报错:

Traceback (most recent call last): File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/tools/train.py", line 177, in main() File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/tools/train.py", line 173, in main run(FLAGS, cfg) File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/tools/train.py", line 127, in run trainer.train(FLAGS.eval) File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/ppdet/engine/trainer.py", line 485, in train self._eval_with_loader(self._eval_loader) File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/ppdet/engine/trainer.py", line 502, in _eval_with_loader self.dataset, self.cfg.worker_num, self._eval_batch_sampler) File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/ppdet/data/reader.py", line 197, in call self.loader = iter(self.dataloader) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/reader.py", line 434, in iter return _DataLoaderIterMultiProcess(self) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 355, in init self._init_workers() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 378, in _init_workers indices_queue = multiprocessing.Queue() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/context.py", line 102, in Queue return Queue(maxsize, ctx=self.get_context()) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 41, in init self._reader, self._writer = connection.Pipe(duplex=False) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/connection.py", line 517, in Pipe fd1, fd2 = os.pipe() OSError: [Errno 24] Too many open files Exception in thread Thread-554: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 12470

Exception in thread Thread-230: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 16942

Exception in thread Thread-218: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 15471 Exception in thread Thread-290: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 21273

Exception in thread Thread-122: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 2 workers exit unexpectedly, pids: 6043, 6045

Exception in thread Thread-2: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 6 workers exit unexpectedly, pids: 23414, 23415, 23416, 23417, 23418, 23419

Exception in thread Thread-182: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 3 workers exit unexpectedly, pids: 10672, 10673, 10675

Exception in thread Thread-482: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 4 workers exit unexpectedly, pids: 5695, 5696, 5708, 5710

Exception in thread Thread-266: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 2 workers exit unexpectedly, pids: 19512, 19515

Exception in thread Thread-446: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 2183

Exception in thread Thread-494: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 3 workers exit unexpectedly, pids: 7178, 7179, 7181

Exception in thread Thread-158: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 4 workers exit unexpectedly, pids: 8557, 8558, 8559, 8560

Exception in thread Thread-398: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 6 workers exit unexpectedly, pids: 29836, 29837, 29838, 29839, 29840, 29841

Exception in thread Thread-555: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 5 workers exit unexpectedly, pids: 12648, 12649, 12650, 12651, 12652

Exception ignored in: <function _DataLoaderIterMultiProcess.del at 0x7f59a5f60710> Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 675, in del self._try_shutdown_all() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 474, in _try_shutdown_all if not self._shutdown: AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown' INFO 2022-06-08 20:40:44,363 launch_utils.py:341] terminate all the procs ERROR 2022-06-08 20:40:44,364 launch_utils.py:604] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2022-06-08 20:40:48,368 launch_utils.py:341] terminate all the procs INFO 2022-06-08 20:40:48,369 launch.py:311] Local processes completed.

yghstill commented 2 years ago

@Dandelion111 看你的报错是OSError: [Errno 24] Too many open files 。应该是环境的问题