更新2.0稳定版后训练报错

thugbobby commented 3 years ago

今天更新到paddlepaddle的2.0稳定版后，执行快速开始的roadsign训练报错，没有修改过配置文件。训练命令如下： python tools/train.py -c configs/yolov3_mobilenet_v1_roadsign.yml --eval -o use_gpu=true 报错信息如下：

2021-01-30 11:37:47,682-INFO: If regularizer of a Parameter has been set by 'fluid.ParamAttr' or 'fluid.WeightNormParamAttr' already. The Regularization[L2Decay, regularization_coeff=0.000500] in Optimizer will not take effect, and it will only be applied to other Parameters!
2021-01-30 11:37:48,489-ERROR: Config dataset_dir dataset/roadsign_voc is not exits!
2021-01-30 11:37:48,489-WARNING: Config annotation dataset/roadsign_voc\valid.txt is not a file, dataset config is not valid
2021-01-30 11:37:48,490-INFO: Dataset E:\workspace\PaddleDetection\dataset\roadsign_voc is not valid for reason above, try searching C:\Users\thugbobby/.cache/paddle/dataset or downloading dataset...
2021-01-30 11:37:48,492-INFO: Found C:\Users\thugbobby/.cache/paddle/dataset\roadsign_voc\annotations
2021-01-30 11:37:48,508-INFO: Found C:\Users\thugbobby/.cache/paddle/dataset\roadsign_voc\images
W0130 11:37:48.674729 10292 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 10.2
W0130 11:37:48.707641 10292 device_context.cc:372] device: 0, cuDNN Version: 7.6.
2021-01-30 11:37:50,916-WARNING: variable yolo_output.1.conv.bias not used
2021-01-30 11:37:50,916-WARNING: variable yolo_output.0.conv.bias not used
2021-01-30 11:37:50,917-WARNING: variable yolo_output.0.conv.weights not used
2021-01-30 11:37:50,917-WARNING: variable yolo_output.2.conv.weights not used
2021-01-30 11:37:50,917-WARNING: variable yolo_output.1.conv.weights not used
2021-01-30 11:37:50,917-WARNING: variable yolo_output.2.conv.bias not used
2021-01-30 11:37:51,031-ERROR: Config dataset_dir dataset/roadsign_voc is not exits!
2021-01-30 11:37:51,032-WARNING: Config annotation dataset/roadsign_voc\train.txt is not a file, dataset config is not valid
2021-01-30 11:37:51,032-INFO: Dataset E:\workspace\PaddleDetection\dataset\roadsign_voc is not valid for reason above, try searching C:\Users\thugbobby/.cache/paddle/dataset or downloading dataset...
2021-01-30 11:37:51,033-INFO: Found C:\Users\thugbobby/.cache/paddle/dataset\roadsign_voc\annotations
2021-01-30 11:37:51,033-INFO: Found C:\Users\thugbobby/.cache/paddle/dataset\roadsign_voc\images
W0130 11:37:51.497886 10292 build_strategy.cc:171] fusion_group is not enabled for Windows/MacOS now, and only effective when running with CUDA GPU.
E:\workspace\PaddleDetection\ppdet\data\reader.py:89: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
  if isinstance(item, collections.Sequence) and len(item) == 0:
tools/train.py:283: RuntimeWarning: divide by zero encountered in double_scalars
  ips = float(cfg['TrainReader']['batch_size']) / time_cost
2021-01-30 11:37:52,254-INFO: iter: 0, lr: 0.000033, 'loss': '15265.234375', eta: 0:00:00, batch_cost: 0.00000 sec, ips: inf images/sec
2021-01-30 11:38:00,457-INFO: iter: 20, lr: 0.000047, 'loss': '27.008541', eta: 0:25:11, batch_cost: 0.42224 sec, ips: 18.94664 images/sec
2021-01-30 11:38:09,789-INFO: iter: 40, lr: 0.000060, 'loss': '21.406960', eta: 0:28:14, batch_cost: 0.47588 sec, ips: 16.81111 images/sec
2021-01-30 11:38:17,577-INFO: iter: 60, lr: 0.000073, 'loss': '15.545045', eta: 0:23:27, batch_cost: 0.39759 sec, ips: 20.12127 images/sec
2021-01-30 11:38:26,032-INFO: iter: 80, lr: 0.000087, 'loss': '13.640253', eta: 0:24:33, batch_cost: 0.41864 sec, ips: 19.10942 images/sec
2021-01-30 11:38:36,156-INFO: iter: 100, lr: 0.000100, 'loss': '12.718939', eta: 0:28:25, batch_cost: 0.48719 sec, ips: 16.42061 images/sec
2021-01-30 11:38:45,802-INFO: iter: 120, lr: 0.000100, 'loss': '10.709991', eta: 0:27:41, batch_cost: 0.47736 sec, ips: 16.75901 images/sec
2021-01-30 11:38:53,403-INFO: iter: 140, lr: 0.000100, 'loss': '12.132683', eta: 0:23:03, batch_cost: 0.39985 sec, ips: 20.00771 images/sec
2021-01-30 11:39:02,317-INFO: iter: 160, lr: 0.000100, 'loss': '11.912944', eta: 0:24:43, batch_cost: 0.43114 sec, ips: 18.55555 images/sec
2021-01-30 11:39:09,470-INFO: iter: 180, lr: 0.000100, 'loss': '11.560947', eta: 0:20:04, batch_cost: 0.35221 sec, ips: 22.71392 images/sec
2021-01-30 11:39:17,629-INFO: iter: 200, lr: 0.000100, 'loss': '10.136917', eta: 0:24:29, batch_cost: 0.43212 sec, ips: 18.51357 images/sec
2021-01-30 11:39:17,632-INFO: Save model to output\yolov3_mobilenet_v1_roadsign\200.
W0130 11:39:18.891667 10292 build_strategy.cc:171] fusion_group is not enabled for Windows/MacOS now, and only effective when running with CUDA GPU.
Traceback (most recent call last):
  File "tools/train.py", line 399, in <module>
    main()
  File "tools/train.py", line 308, in main
    results = eval_run(
  File "E:\workspace\PaddleDetection\ppdet\utils\eval_utils.py", line 146, in eval_run
    outs = exe.run(compile_program,
  File "C:\Users\thugbobby\anaconda3\lib\site-packages\paddle\fluid\executor.py", line 1110, in run
    six.reraise(*sys.exc_info())
  File "C:\Users\thugbobby\anaconda3\lib\site-packages\six.py", line 703, in reraise
    raise value
  File "C:\Users\thugbobby\anaconda3\lib\site-packages\paddle\fluid\executor.py", line 1098, in run
    return self._run_impl(
  File "C:\Users\thugbobby\anaconda3\lib\site-packages\paddle\fluid\executor.py", line 1244, in _run_impl
    return self._run_parallel(
  File "C:\Users\thugbobby\anaconda3\lib\site-packages\paddle\fluid\executor.py", line 913, in _run_parallel
    tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
OSError: (External)  Cublas error, `CUBLAS_STATUS_ALLOC_FAILED`. Resource allocation failed inside the cuBLAS library.  (at D:\v2.0.0\paddle\paddle/fluid/platform/cuda_helper.h:81)

请问是什么问题？谢谢。

heavengate commented 3 years ago

你好，请问你使用的Paddle和PaddleDetection版本分别是多少呢，还有看这个报错报在cuBLAS上，是否是CUDA，cuDNN版本这些和paddle要求的版本不匹配呢

thugbobby commented 3 years ago

你好，请问你使用的Paddle和PaddleDetection版本分别是多少呢，还有看这个报错报在cuBLAS上，是否是CUDA，cuDNN版本这些和paddle要求的版本不匹配呢

你好，paddle是2.0.0稳定版，paddleDetection是0.5版本 cuda是10.2 cudnn是7.6，这些都是按照要求装的。

thinkthinking commented 1 year ago

欢迎使用PaddleDetection，如有问题可以reopen

PaddlePaddle / PaddleDetection

更新2.0稳定版后训练报错 #2143