PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.62k stars 2.87k forks source link

OSError: (External) CUDA error(700), an illegal memory access was encountered. #8711

Open dabensongbing opened 10 months ago

dabensongbing commented 10 months ago

问题确认 Search before asking

请提出你的问题 Please ask your question

在git下该项目后,按照官方教程配置后,无法开始训练,CUDA版本为11.6 Cudnn版本为8.4,显卡3060laptop,paddlepaddle-gpu版本为 2.3.2.post116,运行引导跑通没有问题,训练报错如下:

(temppddet) E:\tempppdet>python tools/train.py -c configs/picodet/picodet_xs_320_coco_lcnet.yml --eval D:\conda\envs\temppddet\lib\site-packages\setuptools\sandbox.py:13: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources D:\conda\envs\temppddet\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call topkg_resources.declare_namespace('google'). Implementing implicit namespace packages (as specified in PEP 420) is preferred topkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages declare_namespace(pkg) Warning: Unable to use numba in PP-Tracking, please install numba, for example(python3.7):pip install numba==0.56.4 Warning: Unable to use numba in PP-Tracking, please install numba, for example(python3.7):pip install numba==0.56.4 W1110 15:53:01.155753 13448 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 11.6 W1110 15:53:01.159790 13448 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4. [11/10 15:53:03] ppdet.utils.checkpoint INFO: ['last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [96] in model head.conv_feat.se.0.fc.bias. And the weight fc.bias will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [96, 96, 1, 1] in model head.conv_feat.se.0.fc.weight. And the weight fc.weight will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [96] in model head.conv_feat.se.1.fc.bias. And the weight fc.bias will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [96, 96, 1, 1] in model head.conv_feat.se.1.fc.weight. And the weight fc.weight will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [96] in model head.conv_feat.se.2.fc.bias. And the weight fc.bias will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [96, 96, 1, 1] in model head.conv_feat.se.2.fc.weight. And the weight fc.weight will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1000] in pretrained weight fc.bias is unmatched with the shape [96] in model head.conv_feat.se.3.fc.bias. And the weight fc.bias will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: The shape [1280, 1000] in pretrained weight fc.weight is unmatched with the shape [96, 96, 1, 1] in model head.conv_feat.se.3.fc.weight. And the weight fc.weight will not be loaded [11/10 15:53:03] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\dhbenson/.cache/paddle/weights\LCNet_x0_35_pretrained.pdparams [False] Traceback (most recent call last): File "E:\tempppdet\tools\train.py", line 202, in <module> main() File "E:\tempppdet\tools\train.py", line 198, in main run(FLAGS, cfg) File "E:\tempppdet\tools\train.py", line 151, in run trainer.train(FLAGS.eval) File "E:\tempppdet\ppdet\engine\trainer.py", line 537, in train outputs = model(data) File "D:\conda\envs\temppddet\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__ return self._dygraph_call_func(*inputs, **kwargs) File "D:\conda\envs\temppddet\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "E:\tempppdet\ppdet\modeling\architectures\meta_arch.py", line 60, in forward out = self.get_loss() File "E:\tempppdet\ppdet\modeling\architectures\picodet.py", line 79, in get_loss loss_gfl = self.head.get_loss(head_outs, self.inputs) File "E:\tempppdet\ppdet\modeling\heads\pico_head.py", line 713, in get_loss target_corners = bbox2distance(pos_centers, pos_bbox_targets, File "E:\tempppdet\ppdet\modeling\bbox_utils.py", line 525, in bbox2distance return paddle.stack([left, top, right, bottom], -1) File "D:\conda\envs\temppddet\lib\site-packages\paddle\tensor\manipulation.py", line 903, in stack return layers.stack(x, axis, name) File "D:\conda\envs\temppddet\lib\site-packages\paddle\fluid\layers\nn.py", line 10397, in stack return _C_ops.stack(x, 'axis', axis) OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:251) [operator < stack > error] OSError: (External) CUDA error(700), an illegal memory access was encountered. [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:251) [operator < stack > error]

LokeZhou commented 6 months ago

这个报错一般是显存不够