OSError: (External) CUDA error(700), an illegal memory access was encountered.

xxPete commented 1 year ago

Checklist:

查找历史相关issue寻求解答
翻阅FAQ常见问题汇总和答疑
确认bug是否在新版本里还未修复
翻阅PaddleX 使用文档

描述问题

根据https://aistudio.baidu.com/aistudio/projectdetail/4398052?channelType=0&channel=0这个项目复现的，在aistudio上正常训练，到本地就有问题，显存够的
复现

您是否已经正常运行我们提供的教程？
- 是，可以正常运行
您是否在教程的基础上修改代码内容？还请您提供运行的代码
- 没有
您使用的数据集是？
- 小度熊的实例分割数据集

请提供您出现的报错信息及相关log

2022-10-09 09:05:30,360-WARNING: type object 'QuantizationTransformPass' has no attribute '_supported_quantizable_op_type'
2022-10-09 09:05:30,360-WARNING: If you want to use training-aware and post-training quantization, please use Paddle >= 1.8.4 or develop version
D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
2022-10-09 09:05:31 [INFO]      14 samples in file ./dataset/xiaoduxiong_ins_det/train.json, including 14 positive samples and 0 negative samples.
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
2022-10-09 09:05:31 [INFO]      4 samples in file ./dataset/xiaoduxiong_ins_det/val.json, including 4 positive samples and 0 negative samples.
W1009 09:05:31.109730 19380 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
W1009 09:05:31.112730 19380 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6.
2022-10-09 09:05:31 [INFO]      Loading pretrained model from output/mask_rcnn_r50_fpn\pretrain\mask_rcnn_r50_fpn_2x_coco.pdparams
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.weight doesn't match.(Pretrained: [1024, 81], Actual: [1024, 2])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.bias doesn't match.(Pretrained: [81], Actual: [2])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.weight doesn't match.(Pretrained: [1024, 320], Actual: [1024, 4])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.bias doesn't match.(Pretrained: [320], Actual: [4])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.weight doesn't match.(Pretrained: [80, 256, 1, 1], Actual: [1, 256, 1, 1])
2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.bias doesn't match.(Pretrained: [80], Actual: [1])
2022-10-09 09:05:32 [INFO]      There are 301/307 variables loaded into MaskRCNN.
Traceback (most recent call last):
File ".\train_xiaodu.py", line 40, in <module>
use_vdl=False)
File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 2188, in train
early_stop_patience, use_vdl, resume_checkpoint)
File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 334, in train
use_vdl=use_vdl)
File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\base.py", line 339, in train_loop
outputs = self.run(self.net, data, mode='train')
File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 105, in run
net_out = net(inputs)
File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
out = self.get_loss()
File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 123, in get_loss
bbox_loss, mask_loss, rpn_loss = self._forward()
File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 93, in _forward
rois, rois_num, rpn_loss = self.rpn_head(body_feats, self.inputs)
File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 140, in forward
loss = self.get_loss(scores, deltas, anchors, inputs)
File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 278, in get_loss
pos_ind = paddle.nonzero(pos_mask)
File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\tensor\search.py", line 402, in nonzero
outs = _C_ops.where_index(x)
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue u
sing CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:251)
[operator < where_index > error]

环境

请提供您使用的PaddlePaddle和PaddleX的版本号
- paddlepaddle-gpu 2.3.2.post116
- paddlex 2.1.0
请提供您使用的操作系统信息，如Linux/Windows/MacOS
- Windows
请问您使用的Python版本是？
- 3.7
请问您使用的CUDA/cuDNN的版本号是？
- 11.6/8.6

xxPete commented 1 year ago

补充一下debug后出现的信息

Error: ../paddle/phi/kernels/funcs/scatter.cu.h:66 Assertion `scatter_i >= 0` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be greater than or equal to 0, but received [-1118890112],几百条都是这个

SUNbrightness commented 1 year ago

paddlepaddle-gpu 2.1.3.post112 可以解决问题

keepgoing365 commented 1 year ago

我做分割任务用DeepLabV3P模型也遇到相同报错，设置use_mixed_loss = false后报错消失，貌似deeplab3p不能用混合损失函数。本人环境:win10, paddle-gpu 2.3.2 post112
paddlex 2.1.0

PaddlePaddle / PaddleX

OSError: (External) CUDA error(700), an illegal memory access was encountered. #1614

Checklist:

描述问题

复现

环境