PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.17k stars 5.56k forks source link

运行训练代码报错 #52213

Closed 0721fang closed 6 months ago

0721fang commented 1 year ago

bug描述 Describe the Bug

使用“python3 -m paddle.distributed.launch --log_dir=log_dir/ppyolo --gpus 0,1 tools/train.py -c configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml --eval --amp”报错: LAUNCH INFO 2023-03-28 10:03:02,633 ------------------------- ERROR LOG DETAIL ------------------------- [2023-03-28 10:03:02,633] [ INFO] controller.py:111 - ------------------------- ERROR LOG DETAIL ------------------------- nda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Traceback (most recent call last): File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 172, in main() File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 168, in main run(FLAGS, cfg) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 129, in run trainer.load_weights(cfg.pretrain_weights) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/engine/trainer.py", line 373, in load_weights load_pretrain_weight(self.model, weights) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/utils/checkpoint.py", line 211, in load_pretrain_weight model_dict = model.state_dict() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 1580, in state_dict return self._state_dict_impl( File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 1496, in _state_dict_impl layer_item._state_dict_impl( File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 1496, in _state_dict_impl layer_item._state_dict_impl( File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 1496, in _state_dict_impl layer_item._state_dict_impl( [Previous line repeated 1 more time] File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 1494, in _state_dict_impl destination_temp = destination.copy() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/multiprocess_utils.py", line 135, in handler core._throw_error_if_process_failed() SystemError: (Fatal) DataLoader process (pid 111075) exited unexpectedly with code 1. Error detailed are lost due to multiprocessing. Rerunning with:

  1. If run DataLoader by DataLoader.from_generator(...), run with DataLoader.from_generator(..., use_multiprocess=False) may give better error trace.
  2. If run DataLoader by DataLoader(dataset, ...), run with DataLoader(dataset, ..., num_workers=0) may give better error trace (at /paddle/paddle/fluid/imperative/data_loader.cc:161)

LAUNCH INFO 2023-03-28 10:03:02,834 Exit code 1 [2023-03-28 10:03:02,834] [ INFO] controller.py:141 - Exit code 1

其他补充信息 Additional Supplementary Information

No response

zhangbo9674 commented 1 year ago

你好,是否有根据日志中的提示尝试解决? If run DataLoader by DataLoader.from_generator(...), run with DataLoader.from_generator(..., use_multiprocess=False) may give better error trace. If run DataLoader by DataLoader(dataset, ...), run with DataLoader(dataset, ..., num_workers=0) may give better error trace (at /paddle/paddle/fluid/imperative/data_loader.cc:161)

0721fang commented 1 year ago

请问这个在哪个py文件里面可以查看并更改

0721fang commented 1 year ago

看不懂你们这个错误提示,我只是按照官网运行,修改了一下batch_size和训练集路径

zhangbo9674 commented 1 year ago

你好,因为bug 描述中只是提供了一个启动命令,我这里无法得知在哪个 python 文件。可以麻烦提供一下项目更详细的细节信息么?例如是官网的哪个示例。 另外,根据错误提示,您可以搜索paddle.fluid.io.DataLoader.from_generator或者paddle.io.DataLoader,并设置参数use_multiprocess=False或者num_workers=0,再尝试执行。

0721fang commented 1 year ago

(paddle) [root@localhost PaddleDetection-release-2.5]# python3 -m paddle.distributed.launch --log_dir=log_dir/ppyolo --gpus 0,1,2,3 tools/train.py -c configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml --eval --amp LAUNCH INFO 2023-03-28 10:58:27,076 ----------- Configuration ---------------------- [2023-03-28 10:58:27,076] [ INFO] init.py:45 - ----------- Configuration ---------------------- LAUNCH INFO 2023-03-28 10:58:27,076 devices: 0,1,2,3 [2023-03-28 10:58:27,076] [ INFO] init.py:47 - devices: 0,1,2,3 LAUNCH INFO 2023-03-28 10:58:27,076 elastic_level: -1 [2023-03-28 10:58:27,076] [ INFO] init.py:47 - elastic_level: -1 LAUNCH INFO 2023-03-28 10:58:27,077 elastic_timeout: 30 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - elastic_timeout: 30 LAUNCH INFO 2023-03-28 10:58:27,077 gloo_port: 6767 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - gloo_port: 6767 LAUNCH INFO 2023-03-28 10:58:27,077 host: None [2023-03-28 10:58:27,077] [ INFO] init.py:47 - host: None LAUNCH INFO 2023-03-28 10:58:27,077 ips: None [2023-03-28 10:58:27,077] [ INFO] init.py:47 - ips: None LAUNCH INFO 2023-03-28 10:58:27,077 job_id: default [2023-03-28 10:58:27,077] [ INFO] init.py:47 - job_id: default LAUNCH INFO 2023-03-28 10:58:27,077 legacy: False [2023-03-28 10:58:27,077] [ INFO] init.py:47 - legacy: False LAUNCH INFO 2023-03-28 10:58:27,077 log_dir: log_dir/ppyolo [2023-03-28 10:58:27,077] [ INFO] init.py:47 - log_dir: log_dir/ppyolo LAUNCH INFO 2023-03-28 10:58:27,077 log_level: INFO [2023-03-28 10:58:27,077] [ INFO] init.py:47 - log_level: INFO LAUNCH INFO 2023-03-28 10:58:27,077 master: None [2023-03-28 10:58:27,077] [ INFO] init.py:47 - master: None LAUNCH INFO 2023-03-28 10:58:27,077 max_restart: 3 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - max_restart: 3 LAUNCH INFO 2023-03-28 10:58:27,077 nnodes: 1 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - nnodes: 1 LAUNCH INFO 2023-03-28 10:58:27,077 nproc_per_node: None [2023-03-28 10:58:27,077] [ INFO] init.py:47 - nproc_per_node: None LAUNCH INFO 2023-03-28 10:58:27,077 rank: -1 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - rank: -1 LAUNCH INFO 2023-03-28 10:58:27,077 run_mode: collective [2023-03-28 10:58:27,077] [ INFO] init.py:47 - run_mode: collective LAUNCH INFO 2023-03-28 10:58:27,077 server_num: None [2023-03-28 10:58:27,077] [ INFO] init.py:47 - server_num: None LAUNCH INFO 2023-03-28 10:58:27,077 servers: [2023-03-28 10:58:27,077] [ INFO] init.py:47 - servers: LAUNCH INFO 2023-03-28 10:58:27,077 start_port: 6070 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - start_port: 6070 LAUNCH INFO 2023-03-28 10:58:27,077 trainer_num: None [2023-03-28 10:58:27,077] [ INFO] init.py:47 - trainer_num: None LAUNCH INFO 2023-03-28 10:58:27,077 trainers: [2023-03-28 10:58:27,077] [ INFO] init.py:47 - trainers: LAUNCH INFO 2023-03-28 10:58:27,077 training_script: tools/train.py [2023-03-28 10:58:27,077] [ INFO] init.py:47 - training_script: tools/train.py LAUNCH INFO 2023-03-28 10:58:27,077 training_script_args: ['-c', 'configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml', '--eval', '--amp'] [2023-03-28 10:58:27,077] [ INFO] init.py:47 - training_script_args: ['-c', 'configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml', '--eval', '--amp'] LAUNCH INFO 2023-03-28 10:58:27,077 with_gloo: 1 [2023-03-28 10:58:27,077] [ INFO] init.py:47 - with_gloo: 1 LAUNCH INFO 2023-03-28 10:58:27,077 -------------------------------------------------- [2023-03-28 10:58:27,077] [ INFO] init.py:48 - -------------------------------------------------- LAUNCH INFO 2023-03-28 10:58:27,078 Job: default, mode collective, replicas 1[1:1], elastic False [2023-03-28 10:58:27,078] [ INFO] controller.py:168 - Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2023-03-28 10:58:27,080 Run Pod: owhuje, replicas 4, status ready [2023-03-28 10:58:27,080] [ INFO] controller.py:60 - Run Pod: owhuje, replicas 4, status ready LAUNCH INFO 2023-03-28 10:58:27,146 Watching Pod: owhuje, replicas 4, status running [2023-03-28 10:58:27,146] [ INFO] controller.py:80 - Watching Pod: owhuje, replicas 4, status running I0328 10:58:29.364696 9850 tcp_utils.cc:181] The server starts to listen on IP_ANY:39766 I0328 10:58:29.364926 9850 tcp_utils.cc:130] Successfully connected to 127.0.0.1:39766 W0328 10:58:31.355732 9850 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.7 W0328 10:58:31.356563 9850 gpu_resources.cc:91] device: 0, cuDNN Version: 8.8. loading annotations into memory... Done (t=0.04s) creating index... index created! [03/28 10:58:34] reader WARNING: fail to map batch transform [Gt2YoloTarget_99f518] with error: index 8 is out of bounds for axis 1 with size 8 and stack: Traceback (most recent call last): File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/reader.py", line 73, in call data = f(data) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/transform/batch_operators.py", line 250, in call target[best_n, 6 + cls, gj, gi] = 1. IndexError: index 8 is out of bounds for axis 1 with size 8

Exception in thread Thread-1: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 536, in _thread_loop batch = self._get_data() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 674, in _get_data batch.reraise() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 172, in reraise raise self.exc_type(msg) IndexError: DataLoader worker(2) caught IndexError with message: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 339, in _worker_loop batch = fetcher.fetch(indices) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch data = self.collate_fn(data) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/reader.py", line 79, in call raise e File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/reader.py", line 73, in call data = f(data) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/transform/batch_operators.py", line 250, in call target[best_n, 6 + cls, gj, gi] = 1. IndexError: index 8 is out of bounds for axis 1 with size 8

Process Process-2: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 371, in _worker_loop six.reraise(sys.exc_info()) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Process Process-7: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 371, in _worker_loop six.reraise(sys.exc_info()) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Process Process-1: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 371, in _worker_loop six.reraise(sys.exc_info()) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

[03/28 10:58:34] reader WARNING: fail to map batch transform [Gt2YoloTarget_99f518] with error: index 8 is out of bounds for axis 1 with size 8 and stack: Traceback (most recent call last): File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/reader.py", line 73, in call data = f(data) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/transform/batch_operators.py", line 250, in call target[best_n, 6 + cls, gj, gi] = 1. IndexError: index 8 is out of bounds for axis 1 with size 8

[03/28 10:58:34] reader WARNING: fail to map batch transform [Gt2YoloTarget_99f518] with error: index 8 is out of bounds for axis 1 with size 8 and stack: Traceback (most recent call last): File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/reader.py", line 73, in call data = f(data) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/data/transform/batch_operators.py", line 250, in call target[best_n, 6 + cls, gj, gi] = 1. IndexError: index 8 is out of bounds for axis 1 with size 8

Process Process-4: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 371, in _worker_loop six.reraise(sys.exc_info()) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Process Process-8: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 371, in _worker_loop six.reraise(sys.exc_info()) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Traceback (most recent call last): File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 172, in main() File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 168, in main run(FLAGS, cfg) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 129, in run trainer.load_weights(cfg.pretrain_weights) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/engine/trainer.py", line 373, in load_weights load_pretrain_weight(self.model, weights) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/utils/checkpoint.py", line 214, in load_pretrain_weight param_state_dict = paddle.load(weights_path) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/framework/io.py", line 1044, in load load_result = pickle.load(f, encoding='latin1') File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/multiprocess_utils.py", line 135, in handler core._throw_error_if_process_failed() SystemError: (Fatal) DataLoader process (pid 10892) exited unexpectedly with code 1. Error detailed are lost due to multiprocessing. Rerunning with:

  1. If run DataLoader by DataLoader.from_generator(...), run with DataLoader.from_generator(..., use_multiprocess=False) may give better error trace.
  2. If run DataLoader by DataLoader(dataset, ...), run with DataLoader(dataset, ..., num_workers=0) may give better error trace (at /paddle/paddle/fluid/imperative/data_loader.cc:161)

I0328 10:58:35.619158 10389 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop LAUNCH INFO 2023-03-28 10:58:36,159 Pod failed [2023-03-28 10:58:36,159] [ INFO] controller.py:109 - Pod failed LAUNCH ERROR 2023-03-28 10:58:36,160 Container failed !!! Container rank 0 status failed cmd ['/home/xap/soft/miniconda3/envs/paddle/bin/python3', '-u', 'tools/train.py', '-c', 'configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml', '--eval', '--amp'] code 1 log log_dir/ppyolo/workerlog.0 env {'XDG_SESSION_ID': '52', 'HOSTNAME': 'localhost.localdomain', 'SELINUX_ROLE_REQUESTED': '', 'HARDWARE_PLATFORM': 'x86_64', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'HISTSIZE': '1000', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/cv2/qt/plugins', 'SSH_CLIENT': '192.168.12.66 56124 22', 'CONDA_SHLVL': '1', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'SELINUX_USE_CURRENT_RANGE': '', 'QTDIR': '/usr/lib64/qt-3.3', 'OLDPWD': '/home/guoyanfang/paddle', 'ASCEND_TOOLKIT_HOME': '/usr/local/Ascend/ascend-toolkit/latest', 'QTINC': '/usr/lib64/qt-3.3/include', 'ASCEND_OPP_PATH': '/usr/local/Ascend/ascend-toolkit/latest/opp', 'SSH_TTY': '/dev/pts/9', 'QT_GRAPHICSSYSTEM_CHECKED': '1', 'CUDA_HOME': '/usr/local/cuda-11.7', 'USER': 'root', 'LD_LIBRARY_PATH': '/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/cv2/../../lib64:/usr/local/cuda-11.7/lib64/:/usr/local/TensorRT-8.6.0.12/lib', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.axv=01;35:.anx=01;35:.ogv=01;35:.ogx=01;35:.aac=01;36:.au=01;36:.flac=01;36:.mid=01;36:.midi=01;36:.mka=01;36:.mp3=01;36:.mpc=01;36:.ogg=01;36:.ra=01;36:.wav=01;36:.axa=01;36:.oga=01;36:.spx=01;36:.xspf=01;36:', 'ASCEND_AICPU_PATH': '/usr/local/Ascend/ascend-toolkit/latest', 'CONDA_EXE': '/home/xap/soft/miniconda3/bin/conda', 'ASCEND_HOME_PATH': '/usr/local/Ascend/ascend-toolkit/latest', '_CE_CONDA': '', 'MAIL': '/var/spool/mail/root', 'PATH': '/home/xap/soft/miniconda3/envs/paddle/bin:/usr/local/cuda-11.7/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/bin/python3/bin:/home/xap/soft/miniconda3/condabin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/TensorRT-8.6.0.12/bin:/root/bin', 'CONDA_PREFIX': '/home/xap/soft/miniconda3/envs/paddle', 'PWD': '/home/guoyanfang/paddle/PaddleDetection-release-2.5', 'LANG': 'zh_CN.UTF-8', 'TOOLCHAIN_HOME': '/usr/local/Ascend/ascend-toolkit/latest/toolkit', 'SELINUX_LEVEL_REQUESTED': '', '_CE_M': '', 'HISTCONTROL': 'ignoredups', 'SHLVL': '1', 'HOME': '/root', 'PYTHONPATH': '/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe:', 'CONDA_PYTHON_EXE': '/home/xap/soft/miniconda3/bin/python', 'LOGNAME': 'root', 'QTLIB': '/usr/lib64/qt-3.3/lib', 'SSH_CONNECTION': '192.168.12.66 56124 192.168.21.240 22', 'DOCKER_REGISTRY': 'registry.cn-hangzhou.aliyuncs.com/mtanzl/tan', 'CONDA_DEFAULT_ENV': 'paddle', 'LESSOPEN': '||/usr/bin/lesspipe.sh %s', 'XDG_RUNTIMEDIR': '/run/user/0', '': '/home/xap/soft/miniconda3/envs/paddle/bin/python3', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_FONTDIR': '/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/cv2/qt/fonts', 'POD_NAME': 'owhuje', 'PADDLE_MASTER': '127.0.0.1:39766', 'PADDLE_GLOBAL_SIZE': '4', 'PADDLE_LOCAL_SIZE': '4', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:39767,127.0.0.1:39768,127.0.0.1:39769,127.0.0.1:39770', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:39767', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '4', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'} [2023-03-28 10:58:36,160] [ ERROR] controller.py:110 - Container failed !!! Container rank 0 status failed cmd ['/home/xap/soft/miniconda3/envs/paddle/bin/python3', '-u', 'tools/train.py', '-c', 'configs/ppyolo/ppyolov2_r50vd_dcn_365e_coco.yml', '--eval', '--amp'] code 1 log log_dir/ppyolo/workerlog.0 env {'XDG_SESSION_ID': '52', 'HOSTNAME': 'localhost.localdomain', 'SELINUX_ROLE_REQUESTED': '', 'HARDWARE_PLATFORM': 'x86_64', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'HISTSIZE': '1000', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/cv2/qt/plugins', 'SSH_CLIENT': '192.168.12.66 56124 22', 'CONDA_SHLVL': '1', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'SELINUX_USE_CURRENT_RANGE': '', 'QTDIR': '/usr/lib64/qt-3.3', 'OLDPWD': '/home/guoyanfang/paddle', 'ASCEND_TOOLKIT_HOME': '/usr/local/Ascend/ascend-toolkit/latest', 'QTINC': '/usr/lib64/qt-3.3/include', 'ASCEND_OPP_PATH': '/usr/local/Ascend/ascend-toolkit/latest/opp', 'SSH_TTY': '/dev/pts/9', 'QT_GRAPHICSSYSTEM_CHECKED': '1', 'CUDA_HOME': '/usr/local/cuda-11.7', 'USER': 'root', 'LD_LIBRARY_PATH': '/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/cv2/../../lib64:/usr/local/cuda-11.7/lib64/:/usr/local/TensorRT-8.6.0.12/lib', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.axv=01;35:.anx=01;35:.ogv=01;35:.ogx=01;35:.aac=01;36:.au=01;36:.flac=01;36:.mid=01;36:.midi=01;36:.mka=01;36:.mp3=01;36:.mpc=01;36:.ogg=01;36:.ra=01;36:.wav=01;36:.axa=01;36:.oga=01;36:.spx=01;36:.xspf=01;36:', 'ASCEND_AICPU_PATH': '/usr/local/Ascend/ascend-toolkit/latest', 'CONDA_EXE': '/home/xap/soft/miniconda3/bin/conda', 'ASCEND_HOME_PATH': '/usr/local/Ascend/ascend-toolkit/latest', '_CE_CONDA': '', 'MAIL': '/var/spool/mail/root', 'PATH': '/home/xap/soft/miniconda3/envs/paddle/bin:/usr/local/cuda-11.7/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/bin/python3/bin:/home/xap/soft/miniconda3/condabin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/TensorRT-8.6.0.12/bin:/root/bin', 'CONDA_PREFIX': '/home/xap/soft/miniconda3/envs/paddle', 'PWD': '/home/guoyanfang/paddle/PaddleDetection-release-2.5', 'LANG': 'zh_CN.UTF-8', 'TOOLCHAIN_HOME': '/usr/local/Ascend/ascend-toolkit/latest/toolkit', 'SELINUX_LEVEL_REQUESTED': '', '_CE_M': '', 'HISTCONTROL': 'ignoredups', 'SHLVL': '1', 'HOME': '/root', 'PYTHONPATH': '/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe:', 'CONDA_PYTHON_EXE': '/home/xap/soft/miniconda3/bin/python', 'LOGNAME': 'root', 'QTLIB': '/usr/lib64/qt-3.3/lib', 'SSH_CONNECTION': '192.168.12.66 56124 192.168.21.240 22', 'DOCKER_REGISTRY': 'registry.cn-hangzhou.aliyuncs.com/mtanzl/tan', 'CONDA_DEFAULT_ENV': 'paddle', 'LESSOPEN': '||/usr/bin/lesspipe.sh %s', 'XDG_RUNTIMEDIR': '/run/user/0', '': '/home/xap/soft/miniconda3/envs/paddle/bin/python3', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_FONTDIR': '/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/cv2/qt/fonts', 'POD_NAME': 'owhuje', 'PADDLE_MASTER': '127.0.0.1:39766', 'PADDLE_GLOBAL_SIZE': '4', 'PADDLE_LOCAL_SIZE': '4', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:39767,127.0.0.1:39768,127.0.0.1:39769,127.0.0.1:39770', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:39767', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '4', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'} LAUNCH INFO 2023-03-28 10:58:36,160 ------------------------- ERROR LOG DETAIL ------------------------- [2023-03-28 10:58:36,160] [ INFO] controller.py:111 - ------------------------- ERROR LOG DETAIL ------------------------- or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Process Process-8: Traceback (most recent call last): File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 371, in _worker_loop six.reraise(sys.exc_info()) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 358, in _worker_loop tensor_list = [ File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/dataloader/worker.py", line 359, in core._array_to_share_memory_tensor(b) ValueError: (InvalidArgument) Input object type error or incompatible array data type. tensor.set() supports array with bool, float16, float32, float64, int8, int16, int32, int64, uint8 or uint16, please check your input or input array data type. (at /paddle/paddle/fluid/pybind/tensor_py.h:549)

Traceback (most recent call last): File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 172, in main() File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 168, in main run(FLAGS, cfg) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/tools/train.py", line 129, in run trainer.load_weights(cfg.pretrain_weights) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/engine/trainer.py", line 373, in load_weights load_pretrain_weight(self.model, weights) File "/home/guoyanfang/paddle/PaddleDetection-release-2.5/ppdet/utils/checkpoint.py", line 214, in load_pretrain_weight param_state_dict = paddle.load(weights_path) File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/framework/io.py", line 1044, in load load_result = pickle.load(f, encoding='latin1') File "/home/xap/soft/miniconda3/envs/paddle/lib/python3.9/site-packages/paddle/fluid/multiprocess_utils.py", line 135, in handler core._throw_error_if_process_failed() SystemError: (Fatal) DataLoader process (pid 10892) exited unexpectedly with code 1. Error detailed are lost due to multiprocessing. Rerunning with:

  1. If run DataLoader by DataLoader.from_generator(...), run with DataLoader.from_generator(..., use_multiprocess=False) may give better error trace.
  2. If run DataLoader by DataLoader(dataset, ...), run with DataLoader(dataset, ..., num_workers=0) may give better error trace (at /paddle/paddle/fluid/imperative/data_loader.cc:161)

I0328 10:58:35.619158 10389 tcp_store.cc:257] receive shutdown event and so quit from MasterDaemon run loop LAUNCH INFO 2023-03-28 10:58:36,362 Exit code 1 [2023-03-28 10:58:36,362] [ INFO] controller.py:141 - Exit code 1

0721fang commented 1 year ago

以上是全部输出

zhangbo9674 commented 1 year ago

你好,根据上面提供的报错栈来看,错误出现在 PaddleDetection 的 Gt2YoloTarget 模块,错误原因为访问越界。初步怀疑可能与数据集的数据格式有关,该问题请麻烦到 PaddleDetection repo 下提交 issue 咨询具体原因:https://github.com/PaddlePaddle/PaddleDetection/issuesimage

jerrywgz commented 1 year ago

可以在batch operators的报错位置中打印下image id,看下是不是每次是同一张图片的问题

paddle-bot[bot] commented 6 months ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。