caiyuanhao1998 / Retinexformer

"Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement" (ICCV 2023) & (NTIRE 2024 Challenge)
https://arxiv.org/abs/2303.06705
MIT License
828 stars 64 forks source link

CUDA error: unknown error #83

Closed jiahhhao closed 3 months ago

jiahhhao commented 3 months ago

您好,环境是按照您给的一模一样配置的。但在basicsr/models/archs/RetinexFormer_arch.py文件中,如果inputs = torch.randn((1, 3, 416, 416)).cuda()按照这个大小将没有问题;如果稍微增加一点inputs = torch.randn((2, 3, 416, 416)).cuda()则会出现RuntimeError: CUDA error: unknown error。 这个问题是怎么引发的呢,我应该从哪些地方开始排查? 感谢回答!

caiyuanhao1998 commented 3 months ago

你好,感谢关注。在这个网络结构脚本的主函数中,由于要测试模型的计算量和参数量,因此将batch size设为1,不然测试的工具函数会报错

jiahhhao commented 3 months ago

感谢回答!我在尝试将IGAB模块加入别的模型时,也是出现了同样的问题。 我应该从哪里找问题呢?

caiyuanhao1998 commented 3 months ago

什么问题,截图看看

jiahhhao commented 3 months ago

抱歉这么晚回复。 下面为在您模型基础上进行了一些删除:

屏幕截图 2024-06-13 155114

然后加到yolov4的输入之前:

屏幕截图 2024-06-13 155201

开始训练就会出现下面的错误: ![Uploading 屏幕截图 2024-06-13 155935.png…]()

我应该怎么寻找错误的原因呢?感谢回答!

jiahhhao commented 3 months ago

抱歉,似乎刚才错误的图片上传失败

屏幕截图 2024-06-13 155935
Start Train
Epoch 1/300:   0%|                                                                                                        | 0/1490 [00:00<?, ?it/s<class 'dict'>Traceback (most recent call last):
  File "/home/zjh/codeSpace/python/Paper/yolov4-pytorch/train.py", line 563, in <module>
    fit_one_epoch(model_train, model, yolo_loss, loss_history, eval_callback, optimizer, epoch, epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, fp16, scaler, save_period, save_dir, local_rank)
  File "/home/zjh/codeSpace/python/Paper/yolov4-pytorch/utils/utils_fit.py", line 34, in fit_one_epoch
    outputs         = model_train(images)
  File "/home/zjh/miniconda3/envs/Retinexformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjh/miniconda3/envs/Retinexformer/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/zjh/miniconda3/envs/Retinexformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjh/codeSpace/python/Paper/yolov4-pytorch/nets/yolo.py", line 136, in forward
    x2, x1, x0 = self.backbone(x)
  File "/home/zjh/miniconda3/envs/Retinexformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjh/codeSpace/python/Paper/yolov4-pytorch/nets/CSPdarknet.py", line 160, in forward
    x = self.conv1(x)
  File "/home/zjh/miniconda3/envs/Retinexformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjh/codeSpace/python/Paper/yolov4-pytorch/nets/CSPdarknet.py", line 34, in forward
    x = self.activation(x)
  File "/home/zjh/miniconda3/envs/Retinexformer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjh/codeSpace/python/Paper/yolov4-pytorch/nets/CSPdarknet.py", line 17, in forward
    return x * torch.tanh(F.softplus(x))
RuntimeError: CUDA error: unknown error
caiyuanhao1998 commented 3 months ago

这个应该是你环境没有装对,或者 GPU 驱动没装好或者不兼容

截屏2024-06-13 下午9 35 59

如果觉得我们的 repo 有用的话,可以帮忙 fork 支持一下吗,感谢

jiahhhao commented 3 months ago

好的,我重新配个环境试试,谢谢!