PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.59k forks source link

FatalError: Segmentation fault #37018

Closed dang-nh closed 1 year ago

dang-nh commented 3 years ago
eval model::   3% 10/300 [00:08<04:12,  1.15it/s]

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
1   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, paddle::platform::Place const&, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
2   paddle::imperative::PreparedOp::Run(paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
3   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::CUDNNConvOpKernel<float>, paddle::operators::CUDNNConvOpKernel<double>, paddle::operators::CUDNNConvOpKernel<paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
4   paddle::operators::CUDNNConvOpKernel<float>::Compute(paddle::framework::ExecutionContext const&) const
5   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
6   paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
7   paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8   paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
9   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::AutoGrowthBestFitAllocator::FreeIdleChunks()
----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1636257571 (unix time) try "date -d @1636257571" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x28) received by PID 960 (TID 0x7f26d386d780) from PID 40 ***]

I don't know where the problem is, and I searched a lot of solutions above, but they couldn't solve it. Can you help me take a look?

paddle-bot-old[bot] commented 3 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

GuoxiaWang commented 3 years ago

@dang-nh194423

Please use the VLOG to get more info:

export GLOG_v=3
export FLAGS_call_stack_level=2

And you can put your code below.

dang-nh commented 3 years ago

@GuoxiaWang Yes, I'm here

GuoxiaWang commented 3 years ago

@dang-nh194423

Please use the VLOG to get more info:

export GLOG_v=3 export FLAGS_call_stack_level=2 And you can put your code below.

dang-nh commented 3 years ago

@GuoxiaWang Excuse me, Can you explain more clearer. I'm using Google Colab for training the pretrained model of PPOCR.

export GLOG_v=3
export FLAGS_call_stack_level=2

This is the code? I tried to paste this code into my notebook. But it didn't display anything! Thank you.

GuoxiaWang commented 3 years ago

@GuoxiaWang Excuse me, Can you explain more clearer. I'm using Google Colab for training the pretrained model of PPOCR.

export GLOG_v=3
export FLAGS_call_stack_level=2

This is the code? I tried to paste this code into my notebook. But it didn't display anything! Thank you.

It is to export environment variable in linux terminal.

You also can set by python

# GLOG_v means VLOG level
# FLAGS_call_stack_level means C++ call stack
import os
os.environ['GLOG_v']="3"
os.environ['FLAGS_call_stack_level']="2"
dang-nh commented 3 years ago

@GuoxiaWang I pasted this code and I tried to run again. But it still display this code below and I don't know what happened 

Streaming output truncated to the last 5000 lines.
I1108 00:19:12.636030   373 tracer.cc:209] No Grad to track for Op: adam
I1108 00:19:12.636104   373 tracer.cc:139] Trace Op: adam
I1108 00:19:12.636132   373 prepared_operator.cc:111] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1108 00:19:12.636152   373 adam_op.cc:72] dims of Beta1Pow : [1]
I1108 00:19:12.636162   373 adam_op.cc:79] dims of Beta2Pow : [1]
I1108 00:19:12.636175   373 adam_op.cu:191] beta1_pow.numel() : 1beta2_pow.numel() : 1
I1108 00:19:12.636185   373 adam_op.cu:193] param.numel(): 512
I1108 00:19:12.636214   373 tracer.cc:209] No Grad to track for Op: adam
I1108 00:19:12.636284   373 tracer.cc:139] Trace Op: adam
I1108 00:19:12.636339   373 prepared_operator.cc:111] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1108 00:19:12.636368   373 adam_op.cc:72] dims of Beta1Pow : [1]
I1108 00:19:12.636405   373 adam_op.cc:79] dims of Beta2Pow : [1]
I1108 00:19:12.636438   373 adam_op.cu:191] beta1_pow.numel() : 1beta2_pow.numel() : 1
I1108 00:19:12.636463   373 adam_op.cu:193] param.numel(): 512
I1108 00:19:12.636495   373 tracer.cc:209] No Grad to track for Op: adam
I1108 00:19:12.636613   373 tracer.cc:139] Trace Op: adam
I1108 00:19:12.636658   373 prepared_operator.cc:111] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1108 00:19:12.636694   373 adam_op.cc:72] dims of Beta1Pow : [1]
I1108 00:19:12.636705   373 adam_op.cc:79] dims of Beta2Pow : [1]
I1108 00:19:12.636721   373 adam_op.cu:191] beta1_pow.numel() : 1beta2_pow.numel() : 1
I1108 00:19:12.636731   373 adam_op.cu:193] param.numel(): 2359296
I1108 00:19:12.636761   373 tracer.cc:209] No Grad to track for Op: adam
I1108 00:19:12.636880   373 tracer.cc:139] Trace Op: adam
I1108 00:19:12.636907   373 prepared_operator.cc:111] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1108 00:19:12.636927   373 adam_op.cc:72] dims of Beta1Pow : [1]
I1108 00:19:12.636937   373 adam_op.cc:79] dims of Beta2Pow : [1]
I1108 00:19:12.636952   373 adam_op.cu:191] beta1_pow.numel() : 1beta2_pow.numel() : 1
I1108 00:19:12.636961   373 adam_op.cu:193] param.numel(): 512
I1108 00:19:12.636989   373 tracer.cc:209] No Grad to track for Op: adam
I1108 00:19:12.637060   373 tracer.cc:139] Trace Op: adam
I1108 00:19:12.637089   373 prepared_operator.cc:111] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I1108 00:19:12.637122   373 adam_op.cc:72] dims of Beta1Pow : [1]
I1108 00:19:12.637133   373 adam_op.cc:79] dims of Beta2Pow : [1]
I1108 00:19:12.637147   373 adam_op.cu:191] beta1_pow.numel() : 1beta2_pow.numel() : 1
I1108 00:19:12.637157   373 adam_op.cu:193] param.numel(): 512
I1108 00:19:12.637187   373 tracer.cc:209] No Grad to track for Op: adam
I1108 00:19:12.637259   373 tracer.cc:139] Trace Op: adam

Can you help me? Thank you. 

GuoxiaWang commented 3 years ago

@dang-nh194423

The log shows C++ runtime log.

Can you attach the full log file?

dang-nh commented 3 years ago

@GuoxiaWang Is this the log? eval.log

GuoxiaWang commented 3 years ago

@dang-nh194423

Yes, but please open GLOG_v and C++ call stack.

# in your start python training script
import os
os.environ['GLOG_v']="3"
os.environ['FLAGS_call_stack_level']="2"
dang-nh commented 3 years ago

@GuoxiaWang I'm so sorry but I pasted this code on Google Colab but it didn't display anything. Thank you so much.

GuoxiaWang commented 3 years ago

: https://github.com/PaddlePaddle/Paddle/issues/37018#issuecomment-962712009

So, how do you get this log. I need the full log, but you just paste the tail of log file.

dang-nh commented 3 years ago

image I can't find the log file of evaluation. This folder has only train.log, I read train.log and I think the eval log file will be the same. Because I got this error when I run evaluation the model

GuoxiaWang commented 3 years ago

@dang-nh194423

# evaluation script
import os
os.environ['GLOG_v']="3"
os.environ['FLAGS_call_stack_level']="2"

# Please set environment variable where you run the paddle code
# GLOG_v means VLOG level
# FLAGS_call_stack_level means C++ call stack
# os.environ['GLOG_v']="3" will print C++ VLOG(3) info
# os.environ['FLAGS_call_stack_level']="2" will print C++ call stack

https://github.com/PaddlePaddle/Paddle/issues/37018#issue-1046657151 I can not find anything useful debug info. So, I need to debug what op and where raise the exception.

GuoxiaWang commented 3 years ago

@dang-nh194423

image

What error it is?

dang-nh commented 3 years ago

@GuoxiaWang I think it is not error, because when I evaluate with another model (Detection by MobileNet), no error happened. But when I use ResNet18, this error happened 

GuoxiaWang commented 3 years ago

@dang-nh194423

image

indicates the error is happened when GPU memory alloc in Conv layer

dang-nh commented 3 years ago

@GuoxiaWang Yes, thank you. But what should I do now?

GuoxiaWang commented 3 years ago

@dang-nh194423

Can you paste your code ?

dang-nh commented 3 years ago

@GuoxiaWang I only use 1 line below 😊

!python3 tools/eval.py -c configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml -o Global.checkpoints=./output/ch_db_res18_2/latest
GuoxiaWang commented 3 years ago

@dang-nh194423

What's Repo?

dang-nh commented 3 years ago

@GuoxiaWang I use Google Colab so I don't push it to github. Because I want to train the pretrained model. But I found only one page on how to do it. Did you need it, I will share to you

GuoxiaWang commented 3 years ago

@dang-nh194423

Segmentation fault

maybe it is an illegal attempt to access not initialized tensor

dang-nh commented 3 years ago

So, there is no solution for this error 😢

GuoxiaWang commented 3 years ago

@dang-nh194423

There are no more debug info,I have no idea to find the real reason.

dang-nh commented 3 years ago

Thank you so much for your help! Nice to meet you!