PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.23k stars 5.58k forks source link

云上机器0.11.0版本遇到 hl_gpu_matrix_kernel.cuh:181] Check failed: cudaSuccess == err (0 vs. 8) [hl_gpu_apply_unary_op failed] CUDA error: invalid device function错误 #6524

Closed mengquann closed 6 years ago

mengquann commented 6 years ago

我用的pip install paddle-gpu=0.11.0 命令安装最新版本paddle gpu 版本。

电脑 CUDA信息为: Device 0: "TITAN X (Pascal)" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 12189 MBytes (12781551616 bytes) (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1531 MHz (1.53 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 6 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

因为pip 里面版本支持cudnn 5.0 cuda 8.0 版本。所以从nvidia官网下载 cudnn 5.0版本,然后运行 export LD_LIBRARY_PATH=/mnt/home/work/cudnn/cudnn_v5.0/cuda/lib64:$LD_LIBRARY_PATH 命令 将cudnn 5.0版本加入环境变量。

运行paddle 的 models/image_classfication 下面的 train.py 遇到 [INFO 2017-12-12 17:25:11,499 layers.py:2829] output for __pool_1__: c = 2048, h = 1, w = 1, size = 2048 F1212 17:25:13.790470 108736 hl_gpu_matrix_kernel.cuh:181] Check failed: cudaSuccess == err (0 vs. 8) [hl_gpu_apply_unary_op failed] CUDA error: invalid device function Check failure stack trace: @ 0x7f9762f06bcd google::LogMessage::Fail() @ 0x7f9762f0a67c google::LogMessage::SendToLog() @ 0x7f9762f066f3 google::LogMessage::Flush() @ 0x7f9762f0bb8e google::LogMessageFatal::~LogMessageFatal() @ 0x7f9762d7b3eb hl_gpu_apply_unary_op<>() @ 0x7f9762d7b75d paddle::BaseMatrixT<>::applyUnary<>() @ 0x7f9762d7b9a3 paddle::BaseMatrixT<>::zero() @ 0x7f9762c12655 paddle::Parameter::enableType() @ 0x7f9762c0dd1c paddle::parameterInitNN() @ 0x7f9762c1058e paddle::NeuralNetwork::init() @ 0x7f9762c39eaf paddle::GradientMachine::create() @ 0x7f9762ee3495 GradientMachine::createFromPaddleModelPtr() @ 0x7f9762ee367f GradientMachine::createByConfigProtoStr() @ 0x7f9762ac0717 _wrap_GradientMachine_createByConfigProtoStr @ 0x4cb755 PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4ca7df PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4ca088 PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4de858 (unknown) @ 0x4b0c93 PyObject_Call @ 0x4f452e (unknown) @ 0x4b0c93 PyObject_Call @ 0x4f42a7 (unknown) @ 0x4b669c (unknown) @ 0x4b0c93 PyObject_Call @ 0x4c9f9f PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4ca7df PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4c24a9 PyEval_EvalCode Aborted (core dumped) 错误。

查看以前的issue,发现都是让改flags.cmake, 不过最近的flags.cmake 文件版本和以前issue中有很大差别,以前修改方法看起来对这个问题不太适合了。 请问怎么可以修改这个问题?

typhoonzero commented 6 years ago

pypi默认应该是cuda 7.5编译的,可以从 http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/build_and_install/pip_install_cn.html 找到对应的版本的whl包安装试下。

Yancey1989 commented 6 years ago

pypi上的版本是cuda7.5+cudnn5的版本。

mengquann commented 6 years ago

@typhoonzero @Yancey1989 谢谢你们的回答 用whl安装完后 可以跑GPU,不过还有另外一个问题 跑 图像分类样例 https://github.com/PaddlePaddle/models/tree/develop/image_classification 会出现 Fail to allocate GPU memory 37888 bytes

[INFO 2017-12-13 13:58:33,034 layers.py:3264] output for batch_norm_51: c = 512, h = 7, w = 7, size = 25088 [INFO 2017-12-13 13:58:33,035 layers.py:2696] output for conv_52: c = 2048, h = 7, w = 7, size = 100352 [INFO 2017-12-13 13:58:33,036 layers.py:3264] output for batch_norm_52: c = 2048, h = 7, w = 7, size = 100352 [INFO 2017-12-13 13:58:33,037 layers.py:2838] output for pool_1: c = 2048, h = 1, w = 1, size = 2048 F1213 13:58:37.907372 120471 Allocator.h:89] Check failed: ptr Fail to allocate GPU memory 37888 bytes Check failure stack trace: @ 0x7f0ef9c5c6ad google::LogMessage::Fail() @ 0x7f0ef9c6015c google::LogMessage::SendToLog() @ 0x7f0ef9c5c1d3 google::LogMessage::Flush() @ 0x7f0ef9c6166e google::LogMessageFatal::~LogMessageFatal() @ 0x7f0ef9b9477e paddle::GpuAllocator::alloc() @ 0x7f0ef9b77c7f paddle::PoolAllocator::alloc() @ 0x7f0ef9b7768f paddle::GpuMemoryHandle::GpuMemoryHandle() @ 0x7f0ef9b8a72e paddle::GpuVectorT<>::GpuVectorT() @ 0x7f0ef9b8a8e8 paddle::VectorT<>::create() @ 0x7f0ef9b8a999 paddle::VectorT<>::createParallelVector() @ 0x7f0ef9a0ec76 paddle::Parameter::enableType() @ 0x7f0ef9a09f9a paddle::parameterInitNN() @ 0x7f0ef9a0ccae paddle::NeuralNetwork::init() @ 0x7f0ef9a354ef paddle::GradientMachine::create() @ 0x7f0ef9c39125 GradientMachine::createFromPaddleModelPtr() @ 0x7f0ef9c3930f GradientMachine::createByConfigProtoStr() @ 0x7f0ef9898887 _wrap_GradientMachine_createByConfigProtoStr @ 0x4cb755 PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4ca7df PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4ca088 PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx @ 0x4de858 (unknown) @ 0x4b0c93 PyObject_Call @ 0x4f452e (unknown) @ 0x4b0c93 PyObject_Call @ 0x4f42a7 (unknown) @ 0x4b669c (unknown) @ 0x4b0c93 PyObject_Call @ 0x4c9f9f PyEval_EvalFrameEx @ 0x4c2705 PyEval_EvalCodeEx Aborted (core dumped) mengquan@iot-data-gpu-01:~/paddleTest/models/image_classification$

我的机器其实GPU memory还蛮大的。

下面是跑tensorflow 时候输出gpu信息: 2017-12-13 13:57:27.603099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:00:06.0 Total memory: 11.90GiB Free memory: 11.61GiB

typhoonzero commented 6 years ago

Closing due to low activity.