机器翻译出错Check failed: cudaSuccess == cudaStat (0 vs. 3)

xiaolv3366 commented 5 years ago

1）PaddlePaddle版本：paddlepaddle-gpu==1.2.0.post97 2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况 3）GPU：Tesla P100 4）系统环境：NVIDIA-Linux-x86_64-384.145.run cuda_9.0.176_384.81_linux-run cudnn-9.0-linux-x64-v7.0.5.tgz Python 2.7.5 Linux izwz914zh6jvahu8pnd3drz 3.10.0-957.27.2.el7.x86_64 #1

模型信息 1）模型名称：机器翻译

xiaolv3366 commented 5 years ago

这个问题太奇怪了，代码放到uwsgi就不行，正常python启动就可以。

hong19860320 commented 5 years ago

看起来像是GPU内存分配问题，像是GPU内存不足，能提供更加完成的log吗？

xiaolv3366 commented 5 years ago

Starting uWSGI 2.0.17.1 (64bit) on [Tue Aug 27 14:56:52 2019] compiled with version: 5.4.0 20160609 on 19 December 2018 06:32:50 os: Linux-4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 nodename: lof-dl-ser machine: x86_64 clock source: unix pcre jit disabled detected number of CPU cores: 48 current working directory: /root writing pidfile to /var/run/uwsgi.pid detected binary path: /usr/local/nginx/uwsgi/uwsgi uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) chdir() to /home/www/api/FlaskSplit your processes number limit is 127058 your memory page size is 4096 bytes detected max file descriptor number: 1024 lock engine: pthread robust mutexes thunder lock: disabled (you can enable it with --thunder-lock) uwsgi socket 0 bound to TCP address 127.0.0.1:9000 fd 3 uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) Python version: 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] Python threads support is disabled. You can enable it with --enable-threads Python main interpreter initialized at 0xbe6560 uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) your server socket listen backlog is limited to 120 connections your mercy for graceful operations on workers is 60 seconds mapped 218760 bytes (213 KB) for 2 cores Operational MODE: preforking I0827 14:56:57.150194 60425 Util.cpp:166] commandline: --use_gpu=True --trainer_count=1 Init Predict Network Init FDC Network WSGI app 0 (mountpoint='') ready in 8 seconds on interpreter 0xbe6560 pid: 60425 (default app) uWSGI running as root, you can use --uid/--gid/--chroot options WARNING: you are running uWSGI as root !!! (use the --uid flag) uWSGI is running in multiple interpreter mode spawned uWSGI master process (pid: 60425) spawned uWSGI worker 1 (pid: 60496, cores: 1) spawned uWSGI worker 2 (pid: 60497, cores: 1) F0827 14:57:06.471565 60497 hl_cuda_device.cc:294] Check failed: cudaSuccess == cudaStat (0 vs. 3) Cuda Error: initialization error Check failure stack trace: @ 0x7fbcbf70a4ad google::LogMessage::Fail() @ 0x7fbcbf70df5c google::LogMessage::SendToLog() @ 0x7fbcbf709fd3 google::LogMessage::Flush() @ 0x7fbcbf70f46e google::LogMessageFatal::~LogMessageFatal() @ 0x7fbcbf6c3206 hl_malloc_host() @ 0x7fbcbf4ff3b6 paddle::CudaHostAllocator::alloc() @ 0x7fbcbf53e78f paddle::PoolAllocator::alloc() @ 0x7fbcbf4fe876 paddle::CpuMemoryHandle::CpuMemoryHandle() @ 0x7fbcbf50b87e paddle::CpuVectorT<>::CpuVectorT() @ 0x7fbcbf50bd3a paddle::VectorT<>::create() @ 0x7fbcbf6f3305 IVector::create() @ 0x7fbcbf291a28 _wrap_IVector_create @ 0x7fbcf1129772 PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf1128f1d PyEval_EvalFrameEx @ 0x7fbcf1129044 PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf11b6370 (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf11fd3ac (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf112735c PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf1128f1d PyEval_EvalFrameEx @ 0x7fbcf126005c PyEval_EvalCodeEx @ 0x7fbcf11b6370 (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf11fd3ac (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf11aa4f5 (unknown) @ 0x7fbcf1189273 PyObject_Call @ 0x7fbcf112735c PyEval_EvalFrameEx DAMN ! worker 2 (pid: 60497) died, killed by signal 6 :( trying respawn ... Respawned uWSGI worker 2 (new pid: 60499)

xiaolv3366 commented 5 years ago

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 53708 C /usr/local/bin/python3.7 438MiB | | 0 61842 C /usr/local/nginx/uwsgi/uwsgi 858MiB | | 1 61842 C /usr/local/nginx/uwsgi/uwsgi 656MiB | +-----------------------------------------------------------------------------+

hong19860320 commented 5 years ago

你用这个环境变量试把GPU内存设置小点试试: $export FLAGS_fraction_of_gpu_memory_to_use=0.80

xiaolv3366 commented 5 years ago

@hong19860320 还是一样报错。

xiaolv3366 commented 5 years ago

机器是两个Tesla V100 16G，GPU内存是够的。

xiaolv3366 commented 5 years ago

paddle+flask+uwsgi+nginx 在GPU环境出错，CPU正常。 python命令启动也是正常的。

xiaolv3366 commented 5 years ago

问题解决了，uwsgi与paddle协作的问题。

hong19860320 commented 5 years ago

问题解决了，uwsgi与paddle协作的问题。 @xiaolv3366 具体怎么解决的能描述下吗？方便以后其他同学学习哦！

xiaolv3366 commented 5 years ago

paddle初始化GPU和预测需要放到uwsgi同一个进程。一开始以为是NVIDIA驱动问题，一直试驱动。

PaddlePaddle / Paddle

机器翻译出错Check failed: cudaSuccess == cudaStat (0 vs. 3) #19449