Stinky-Tofu / Stronger-yolo

🔥Improve yolo with latest paper
MIT License
3 stars 0 forks source link

failed to run cuBLAS in RTX2080 #75

Open HochCC opened 5 years ago

HochCC commented 5 years ago

非常感谢你的工作,但是我运行遇到些问题,我用conda新建了一个环境,配置的tensorflow等的版本和readme一模一样,但是错误如下. 因为看到你的显卡也是GeForce RTX 2080,所以不知道你有没有遇到类似的问题。

2019-06-13 18:08:00.822300: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA 2019-06-13 18:08:01.048451: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.71 pciBusID: 0000:65:00.0 totalMemory: 7.76GiB freeMemory: 7.27GiB 2019-06-13 18:08:01.048470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2019-06-13 18:08:01.247636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-13 18:08:01.247660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2019-06-13 18:08:01.247664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2019-06-13 18:08:01.247799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7005 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:65:00.0, compute capability: 7.5) INFO:tensorflow:Restoring parameters from darknet2tf/saved_model/darknet53.ckpt 2019-06-13 18:08:16.501987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2019-06-13 18:08:16.502030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-13 18:08:16.502036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2019-06-13 18:08:16.502041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2019-06-13 18:08:16.502168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7005 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:65:00.0, compute capability: 7.5) INFO:tensorflow:Restoring parameters from darknet2tf/saved_model/darknet53.ckpt 2019-06-13 18:08:27.502570: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/train.py", line 204, in Yolo_train().train() File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/train.py", line 159, in train self.__training: True File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=102400, n=32, k=64 [[Node: yolov3/darknet53/stage1/residual0/conv1/Conv2D = Conv2D[T=DT_FLOAT, _class=["loc:@yolov...ean/Switch"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](yolov3/darknet53/stage0/conv1/LeakyRelu/Maximum, yolov3/darknet53/stage1/residual0/conv1/weight/read)]] [[Node: loss/add_1/_1765 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_28519_loss/add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'yolov3/darknet53/stage1/residual0/conv1/Conv2D', defined at: File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/train.py", line 204, in Yolo_train().train() File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/train.py", line 54, in init pred_sbbox, pred_mbbox, pred_lbbox = yolo.build_nework(self.input_data) File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/model/head/yolov3.py", line 36, in build_nework darknet_route0, darknet_route1, darknet_route2 = darknet53(input_data, self.training) File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/model/backbone/darknet53.py", line 19, in darknet53 filter_num1=32, filter_num2=64, training=training) File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/model/layers.py", line 110, in residual_block training=training) File "/home/rmb-wx/Xinjie/Stronger-yolo-master/v2/model/layers.py", line 86, in convolutional conv = tf.nn.conv2d(input=input_data, filter=weight, strides=strides, padding=padding) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 956, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op op_def=op_def) File "/home/rmb-wx/anaconda3/envs/tf2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): Blas SGEMM launch failed : m=102400, n=32, k=64 [[Node: yolov3/darknet53/stage1/residual0/conv1/Conv2D = Conv2D[T=DT_FLOAT, _class=["loc:@yolov...ean/Switch"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](yolov3/darknet53/stage0/conv1/LeakyRelu/Maximum, yolov3/darknet53/stage1/residual0/conv1/weight/read)]] [[Node: loss/add_1/_1765 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_28519_loss/add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Stinky-Tofu commented 5 years ago

抱歉,环境问题我帮不上什么忙,应该是是cuda版本的问题,我用的是cuda10.0版本

HochCC commented 5 years ago

好的,谢谢:relaxed:,我cuda9.0和10.1都失败了,我再试试cuda10.0