PaddlePaddle / PaddleX

PaddlePaddle End-to-End Development Toolkit(飞桨低代码开发工具)
Apache License 2.0
4.6k stars 909 forks source link

OSError: (External) CUDA error(700), an illegal memory access was encountered. #1614

Open xxPete opened 1 year ago

xxPete commented 1 year ago

Checklist:

  1. 查找历史相关issue寻求解答
  2. 翻阅FAQ常见问题汇总和答疑
  3. 确认bug是否在新版本里还未修复
  4. 翻阅PaddleX 使用文档

描述问题

  1. 您是否已经正常运行我们提供的教程

    • 是,可以正常运行
  2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码

    • 没有
  3. 您使用的数据集是?

    • 小度熊的实例分割数据集
  4. 请提供您出现的报错信息及相关log

    2022-10-09 09:05:30,360-WARNING: type object 'QuantizationTransformPass' has no attribute '_supported_quantizable_op_type'
    2022-10-09 09:05:30,360-WARNING: If you want to use training-aware and post-training quantization, please use Paddle >= 1.8.4 or develop version
    D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
    _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
    D:\Project\PaddleX\PaddleX-develop\paddlex\ppcls\data\preprocess\ops\timm_autoaugment.py:38: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
    _RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
    Warning: import ppdet from source directory without installing, run 'python setup.py install' to install ppdet firstly
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
    2022-10-09 09:05:31 [INFO]      14 samples in file ./dataset/xiaoduxiong_ins_det/train.json, including 14 positive samples and 0 negative samples.
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    2022-10-09 09:05:31 [INFO]      Starting to read file list from dataset...
    2022-10-09 09:05:31 [INFO]      4 samples in file ./dataset/xiaoduxiong_ins_det/val.json, including 4 positive samples and 0 negative samples.
    W1009 09:05:31.109730 19380 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
    W1009 09:05:31.112730 19380 gpu_resources.cc:91] device: 0, cuDNN Version: 8.6.
    2022-10-09 09:05:31 [INFO]      Loading pretrained model from output/mask_rcnn_r50_fpn\pretrain\mask_rcnn_r50_fpn_2x_coco.pdparams
    2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.weight doesn't match.(Pretrained: [1024, 81], Actual: [1024, 2])
    2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_score.bias doesn't match.(Pretrained: [81], Actual: [2])
    2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.weight doesn't match.(Pretrained: [1024, 320], Actual: [1024, 4])
    2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params bbox_head.bbox_delta.bias doesn't match.(Pretrained: [320], Actual: [4])
    2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.weight doesn't match.(Pretrained: [80, 256, 1, 1], Actual: [1, 256, 1, 1])
    2022-10-09 09:05:32 [WARNING]   [SKIP] Shape of pretrained params mask_head.mask_fcn_logits.bias doesn't match.(Pretrained: [80], Actual: [1])
    2022-10-09 09:05:32 [INFO]      There are 301/307 variables loaded into MaskRCNN.
    Traceback (most recent call last):
    File ".\train_xiaodu.py", line 40, in <module>
    use_vdl=False)
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 2188, in train
    early_stop_patience, use_vdl, resume_checkpoint)
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 334, in train
    use_vdl=use_vdl)
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\base.py", line 339, in train_loop
    outputs = self.run(self.net, data, mode='train')
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\cv\models\detector.py", line 105, in run
    net_out = net(inputs)
    File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
    File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\meta_arch.py", line 59, in forward
    out = self.get_loss()
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 123, in get_loss
    bbox_loss, mask_loss, rpn_loss = self._forward()
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\architectures\mask_rcnn.py", line 93, in _forward
    rois, rois_num, rpn_loss = self.rpn_head(body_feats, self.inputs)
    File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
    File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 140, in forward
    loss = self.get_loss(scores, deltas, anchors, inputs)
    File "D:\Project\PaddleX\PaddleX-develop\paddlex\ppdet\modeling\proposal_generator\rpn_head.py", line 278, in get_loss
    pos_ind = paddle.nonzero(pos_mask)
    File "D:\Project\PaddleX\PaddleX-develop\venv\lib\site-packages\paddle\tensor\search.py", line 402, in nonzero
    outs = _C_ops.where_index(x)
    OSError: (External) CUDA error(700), an illegal memory access was encountered.
    [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue u
    sing CUDA, the process must be terminated and relaunched. ] (at ..\paddle\phi\backends\gpu\cuda\cuda_info.cc:251)
    [operator < where_index > error]

    环境

  5. 请提供您使用的PaddlePaddle和PaddleX的版本号

    • paddlepaddle-gpu 2.3.2.post116
    • paddlex 2.1.0
  6. 请提供您使用的操作系统信息,如Linux/Windows/MacOS

    • Windows
  7. 请问您使用的Python版本是?

    • 3.7
  8. 请问您使用的CUDA/cuDNN的版本号是?

    • 11.6/8.6
xxPete commented 1 year ago
SUNbrightness commented 1 year ago

paddlepaddle-gpu 2.1.3.post112 可以解决问题

keepgoing365 commented 1 year ago

我做分割任务用DeepLabV3P模型也遇到相同报错,设置use_mixed_loss = false后报错消失,貌似deeplab3p不能用混合损失函数。 本人环境:win10, paddle-gpu 2.3.2 post112
paddlex 2.1.0