PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.2k stars 5.58k forks source link

机器翻译出错Check failed: cudaSuccess == cudaStat (0 vs. 3) #19449

Closed xiaolv3366 closed 5 years ago

xiaolv3366 commented 5 years ago

   1)PaddlePaddle版本:paddlepaddle-gpu==1.2.0.post97    2)CPU:请提供CPU型号,MKL/OpenBlas/MKLDNN/等数学库的使用情况    3)GPU:Tesla P100    4)系统环境:NVIDIA-Linux-x86_64-384.145.run cuda_9.0.176_384.81_linux-run cudnn-9.0-linux-x64-v7.0.5.tgz Python 2.7.5 Linux izwz914zh6jvahu8pnd3drz 3.10.0-957.27.2.el7.x86_64 #1

xiaolv3366 commented 5 years ago

image

xiaolv3366 commented 5 years ago

这个问题太奇怪了,代码放到uwsgi就不行,正常python启动就可以。

hong19860320 commented 5 years ago

看起来像是GPU内存分配问题,像是GPU内存不足,能提供更加完成的log吗?

xiaolv3366 commented 5 years ago

Starting uWSGI 2.0.17.1 (64bit) on [Tue Aug 27 14:56:52 2019] compiled with version: 5.4.0 20160609 on 19 December 2018 06:32:50 os: Linux-4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 nodename: lof-dl-ser machine: x86_64 clock source: unix pcre jit disabled detected number of CPU cores: 48 current working directory: /root writing pidfile to /var/run/uwsgi.pid detected binary path: /usr/local/nginx/uwsgi/uwsgi uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) chdir() to /home/www/api/FlaskSplit your processes number limit is 127058 your memory page size is 4096 bytes detected max file descriptor number: 1024 lock engine: pthread robust mutexes thunder lock: disabled (you can enable it with --thunder-lock) uwsgi socket 0 bound to TCP address 127.0.0.1:9000 fd 3 uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) Python version: 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] Python threads support is disabled. You can enable it with --enable-threads Python main interpreter initialized at 0xbe6560 uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) your server socket listen backlog is limited to 120 connections your mercy for graceful operations on workers is 60 seconds mapped 218760 bytes (213 KB) for 2 cores Operational MODE: preforking I0827 14:56:57.150194 60425 Util.cpp:166] commandline: --use_gpu=True --trainer_count=1 Init Predict Network Init FDC Network WSGI app 0 (mountpoint='') ready in 8 seconds on interpreter 0xbe6560 pid: 60425 (default app) uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) uWSGI is running in multiple interpreter mode spawned uWSGI master process (pid: 60425) spawned uWSGI worker 1 (pid: 60496, cores: 1) spawned uWSGI worker 2 (pid: 60497, cores: 1) F0827 14:57:06.471565 60497 hl_cuda_device.cc:294] Check failed: cudaSuccess == cudaStat (0 vs. 3) Cuda Error: initialization error Check failure stack trace: @ 0x7fbcbf70a4ad google::LogMessage::Fail() @ 0x7fbcbf70df5c google::LogMessage::SendToLog() @ 0x7fbcbf709fd3 google::LogMessage::Flush() @ 0x7fbcbf70f46e google::LogMessageFatal::~LogMessageFatal() @ 0x7fbcbf6c3206 hl_malloc_host() @ 0x7fbcbf4ff3b6 paddle::CudaHostAllocator::alloc() @ 0x7fbcbf53e78f paddle::PoolAllocator::alloc() @ 0x7fbcbf4fe876 paddle::CpuMemoryHandle::CpuMemoryHandle() @ 0x7fbcbf50b87e paddle::CpuVectorT<>::CpuVectorT() @ 0x7fbcbf50bd3a paddle::VectorT<>::create() @ 0x7fbcbf6f3305 IVector::create() @ 0x7fbcbf291a28 _wrap_IVector_create @ 0x7fbcf1129772 PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf1128f1d PyEval_EvalFrameEx @ 0x7fbcf1129044 PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf11b6370 (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf11fd3ac (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf112735c PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf1128f1d PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf11b6370 (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf11fd3ac (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf11aa4f5 (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf112735c PyEval_EvalFrameEx DAMN ! worker 2 (pid: 60497) died, killed by signal 6 :( trying respawn ... Respawned uWSGI worker 2 (new pid: 60499)

xiaolv3366 commented 5 years ago

CPU环境就没问题,GPU环境报错。 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.145 Driver Version: 384.145 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | N/A | | N/A 33C P0 33W / 250W | 1306MiB / 16152MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 33C P0 36W / 250W | 666MiB / 16152MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 53708 C /usr/local/bin/python3.7 438MiB | | 0 61842 C /usr/local/nginx/uwsgi/uwsgi 858MiB | | 1 61842 C /usr/local/nginx/uwsgi/uwsgi 656MiB | +-----------------------------------------------------------------------------+

hong19860320 commented 5 years ago

你用这个环境变量试把GPU内存设置小点试试: $export FLAGS_fraction_of_gpu_memory_to_use=0.80

xiaolv3366 commented 5 years ago

@hong19860320 还是一样报错。

xiaolv3366 commented 5 years ago

机器是两个Tesla V100 16G,GPU内存是够的。

xiaolv3366 commented 5 years ago

paddle+flask+uwsgi+nginx 在GPU环境出错,CPU正常。 python命令启动也是正常的。

xiaolv3366 commented 5 years ago

问题解决了,uwsgi与paddle协作的问题。

hong19860320 commented 5 years ago

问题解决了,uwsgi与paddle协作的问题。 @xiaolv3366 具体怎么解决的能描述下吗?方便以后其他同学学习哦!

xiaolv3366 commented 5 years ago

paddle初始化GPU和预测需要放到uwsgi同一个进程。一开始以为是NVIDIA驱动问题,一直试驱动。