PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.82k stars 2.89k forks source link

OSError: (External) CUDA error(719), unspecified launch failure. #8492

Open john09282922 opened 1 year ago

john09282922 commented 1 year ago

问题确认 Search before asking

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

Warning: Unable to use JDE/FairMOT/ByteTrack, please install lap, for example: pip install lap, see https://github.com/gatagat/lap Warning: Unable to use numba in PP-Tracking, please install numba, for example(python3.7): pip install numba==0.56.4 Warning: Unable to use numba in PP-Tracking, please install numba, for example(python3.7): pip install numba==0.56.4 Warning: Unable to use MOT metric, please install motmetrics, for example: pip install motmetrics, see https://github.com/longcw/py-motmetrics Warning: Unable to use MCMOT metric, please install motmetrics, for example: pip install motmetrics, see https://github.com/longcw/py-motmetrics loading annotations into memory... Done (t=0.06s) creating index... index created! [07/29 10:23:03] ppdet.data.source.coco INFO: Load [3271 samples valid, 11 samples invalid] in file dataset/mydata/train/annotations/_annotations.coco.json. W0729 10:23:03.598515 797620 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.0, Runtime API Version: 11.8 W0729 10:23:03.599011 797620 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8. [07/29 10:23:05] ppdet.utils.checkpoint INFO: ['fc.bias', 'fc.weight', 'last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded [07/29 10:23:05] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/user/.cache/paddle/weights/PPHGNetV2_X_ssld_pretrained.pdparams Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Traceback (most recent call last): File "/home/user/test1/PaddleDetection/tools/train.py", line 209, in main() File "/home/user/test1/PaddleDetection/tools/train.py", line 205, in main run(FLAGS, cfg) File "/home/user/test1/PaddleDetection/tools/train.py", line 158, in run trainer.train(FLAGS.eval) File "/home/user/test1/PaddleDetection/ppdet/engine/trainer.py", line 577, in train outputs = model(data) File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 60, in forward out = self.get_loss() File "/home/user/test1/PaddleDetection/ppdet/modeling/architectures/detr.py", line 115, in get_loss return self._forward() File "/home/user/test1/PaddleDetection/ppdet/modeling/architectures/detr.py", line 93, in _forward detr_losses = self.detr_head(out_transformer, body_feats, File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, *kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/heads/detr_head.py", line 453, in forward return self.loss( File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(inputs, kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/losses/detr_loss.py", line 434, in forward total_loss = super(DINOLoss, self).forward( File "/home/user/test1/PaddleDetection/ppdet/modeling/losses/detr_loss.py", line 388, in forward total_loss = self._get_prediction_loss( File "/home/user/test1/PaddleDetection/ppdet/modeling/losses/detr_loss.py", line 322, in _get_prediction_loss match_indices = self.matcher( File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, **kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/transformers/matchers.py", line 178, in forward indices = [ File "/home/user/test1/PaddleDetection/ppdet/modeling/transformers/matchers.py", line 179, in linear_sum_assignment(c.split(sizes, -1)[i].numpy()) OSError: (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:267)

training error

复现环境 Environment

OS: Linux Ver: Paddle-gpu 2.5.0 cuda 11.2 ~ 12.0

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

indulgence1 commented 11 months ago

请问解决了吗,我也遇到同样的问题

lyuwenyu commented 8 months ago

自己的数据嘛 用的那个模型 还有版本号