Closed Dandelion111 closed 2 years ago
@Dandelion111 感谢反馈,是windows机器吗?结束时报错是每次必现的吗?
@Dandelion111 感谢反馈,是windows机器吗?结束时报错是每次必现的吗?
linux机器,worker_num只要设置为6每次都报错,我只尝试了0和6这两个值,设置为0不报错
@Dandelion111 感谢反馈,是windows机器吗?结束时报错是每次必现的吗?
我用pycharm训练的
@Dandelion111 看下/dev/shm空间是不是满了?
df -h
@Dandelion111 看下/dev/shm空间是不是满了?
df -h
你好,这个没有满,才用了1%,我又试了下,训练picodet_xs模型200轮cpu大概增加了10g内存,训练picodet_s模型并且把worker_num设置为0的时候情况会好一些,之前训练ppyoloe也没有这个问题,问题是我同事和我用的一个服务器,他那边都正常,我俩的训练环境也一致,
@Dandelion111 看下/dev/shm空间是不是满了?
df -h
具体报错:
Traceback (most recent call last):
File "/home/zhaohaibin/paddle/PaddleDetection-release-2.4/tools/train.py", line 177, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 12470
Exception in thread Thread-230: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 16942
Exception in thread Thread-218: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 15471 Exception in thread Thread-290: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 21273
Exception in thread Thread-122: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 2 workers exit unexpectedly, pids: 6043, 6045
Exception in thread Thread-2: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 6 workers exit unexpectedly, pids: 23414, 23415, 23416, 23417, 23418, 23419
Exception in thread Thread-182: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 3 workers exit unexpectedly, pids: 10672, 10673, 10675
Exception in thread Thread-482: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 4 workers exit unexpectedly, pids: 5695, 5696, 5708, 5710
Exception in thread Thread-266: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 2 workers exit unexpectedly, pids: 19512, 19515
Exception in thread Thread-446: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 2183
Exception in thread Thread-494: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 3 workers exit unexpectedly, pids: 7178, 7179, 7181
Exception in thread Thread-158: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 4 workers exit unexpectedly, pids: 8557, 8558, 8559, 8560
Exception in thread Thread-398: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 6 workers exit unexpectedly, pids: 29836, 29837, 29838, 29839, 29840, 29841
Exception in thread Thread-555: Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 5 workers exit unexpectedly, pids: 12648, 12649, 12650, 12651, 12652
Exception ignored in: <function _DataLoaderIterMultiProcess.del at 0x7f59a5f60710> Traceback (most recent call last): File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 675, in del self._try_shutdown_all() File "/home/zhaohaibin/miniconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 474, in _try_shutdown_all if not self._shutdown: AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown' INFO 2022-06-08 20:40:44,363 launch_utils.py:341] terminate all the procs ERROR 2022-06-08 20:40:44,364 launch_utils.py:604] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2022-06-08 20:40:48,368 launch_utils.py:341] terminate all the procs INFO 2022-06-08 20:40:48,369 launch.py:311] Local processes completed.
@Dandelion111 看你的报错是OSError: [Errno 24] Too many open files
。应该是环境的问题
问题确认 Search before asking
bug描述 Describe the Bug
把worker_num设置为6,会出现如下错误 把worker_num设置为0,不会有这个错误,但是不管把worker_num设置为0还是6,都出现cpu内存持续增长的情况,各位大佬帮忙解答一下
复现环境 Environment
PaddlePaddle: 2.2.2 PaddleDetection: 2.4 python: 3.7 CUDA:11.4 cudnn: 8.3
是否愿意提交PR Are you willing to submit a PR?