Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered

wuyan08 commented 6 years ago

docker容器中使用gpu版本的paddle v2 capi lib进行预测，出现cuda有关的错误，cuda自带的sample运行没有问题。 capi lib是用最新的代码自己编译出来的，编译参数： -DCMAKE_INSTALL_PREFIX=/home/capi_install/ -DCMAKE_BUILD_TYPE=Release -DWITH_C_API=ON -DWITH_SWIG_PY=OFF -DWITH_PYTHON=OFF docker image：docker.paddlepaddlehub.com/paddle:latest-dev cuda版本：8.0 cudnn版本：7.0 GPU: Tesla K40m, Driver: 384.66

错误提示： I0506 13:15:25.327255 18343 Util.cpp:166] commandline: --use_gpu=True I0506 13:15:29.571224 18343 GradientMachine.cpp:94] Initing parameters.. I0506 13:16:06.738917 18343 GradientMachine.cpp:101] Init parameters done. I0506 13:16:06.738988 18343 GradientMachine.cpp:83] Loading parameters from ./data/left F0506 13:16:07.771567 18343 hl_cuda_device.cc:565] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered Check failure stack trace: @ 0x7f5bc24da51d google::LogMessage::Fail() @ 0x7f5bc24dc868 google::LogMessage::SendToLog() @ 0x7f5bc24da02b google::LogMessage::Flush() @ 0x7f5bc24dd73e google::LogMessageFatal::~LogMessageFatal() @ 0x7f5bc29894f7 hl_stream_synchronize() @ 0x7f5bc29a8377 hl_matrix_select_rows() @ 0x7f5bc27b33cf paddle::GpuMatrix::selectRows() @ 0x7f5bc25be6a9 paddle::TableProjection::forward() @ 0x7f5bc256ea7c paddle::MixedLayer::forward() @ 0x7f5bc26d9139 paddle::NeuralNetwork::forward() @ 0x7f5bc24cf5a8 paddle_gradient_machine_forward @ 0x4012d0 main @ 0x7f5bc20bd830 __libc_start_main @ 0x400dd9 _start @ (nil) (unknown) Aborted

能帮看下是什么问题吗？多谢~

Superjomn commented 6 years ago

感觉是否是模型有问题。这一堆错误是GLOG打出来的，目前看只有用户模型/参数有问题才会触发这个严重的异常情况

wuyan08 commented 6 years ago

此模型是用cpu训练出来的，用cpu版本的capi lib进行预测没有问题。用gpu进行预测的模型必须是gpu训练出来的模型吗？

Superjomn commented 6 years ago

可以验证下是载入参数阶段就挂了么，开始预测数据。

wuyan08 commented 6 years ago

加了下log，paddle_gradient_machine_load_parameter_from_disk运行是成功的。上周五的时候yaming说是fluid版本的capi有类似的问题，因cudnn封装出错了导致显存出错，4月下旬进行了修复。paddle v2是否同步进行了修复呢？

Superjomn commented 6 years ago

@pkuyym 可以同步一些信息到这里么

wuyan08 commented 6 years ago

这是预测demo的代码：

char* argv[] = {"--use_gpu=True"};
CHECK(paddle_init(1, (char**)argv));
paddle_gradient_machine left_machine;
long size;
void* buf = read_config(CONFIG_LEFT_BIN, &size);
CHECK(paddle_gradient_machine_create_for_inference(&left_machine, buf, (int)size));
CHECK(paddle_gradient_machine_randomize_param(left_machine));
CHECK(paddle_gradient_machine_load_parameter_from_disk(left_machine, "./data/left"));
// args
paddle_arguments left_in_args = paddle_arguments_create_none();
CHECK(paddle_arguments_resize(left_in_args, 1));
paddle_matrix left_mat = paddle_matrix_create(1, 32, true);
paddle_arguments out_args = paddle_arguments_create_none();

int sentence_ids0[] = {330070,1515788,1606717,163247,1622216,251207,304166,729241,1177768};
int query_count = sizeof(sentence_ids0)/sizeof(int);
paddle_ivector sentence0 = paddle_ivector_create(sentence_ids0, query_count, false, true); //gpu
CHECK(paddle_arguments_set_ids(left_in_args, 0, sentence0));
int seq_pos_array0[] = {0, query_count};
paddle_ivector seq_pos0 = paddle_ivector_create(seq_pos_array0, sizeof(seq_pos_array0) / sizeof(int), false, true);
CHECK(paddle_arguments_set_sequence_start_pos(left_in_args, 0, 0, seq_pos0));
CHECK(paddle_gradient_machine_forward(left_machine, left_in_args, out_args, false));    
CHECK(paddle_arguments_get_value(out_args, 0, left_mat));

HardSoft2023 commented 6 years ago

为什么使用use_gpu=False，会去报 cuda 的错呢？

shanyi15 commented 6 years ago

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持! Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

PaddlePaddle / Paddle

Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #10436