PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.82k stars 2.89k forks source link

训练PPYOLO的时候莫名其妙就报错了 #1389

Closed yeyupiaoling closed 4 years ago

yeyupiaoling commented 4 years ago

环境:

报错信息:

2020-09-10 18:04:12,059-INFO: iter: 3700, lr: 0.000925, 'loss_xy': '0.604115', 'loss_wh': '0.627810', 'loss_obj': '3.390031', 'loss_cls': '2.499329', 'loss_iou': '2.253458', 'loss_iou_aware': '0.017610', 'loss': '9.358524', time: 0.535, eta: 3 days, 1:46:08
2020-09-10 18:05:06,607-INFO: iter: 3800, lr: 0.000950, 'loss_xy': '0.586375', 'loss_wh': '0.599361', 'loss_obj': '3.342260', 'loss_cls': '2.450277', 'loss_iou': '2.174241', 'loss_iou_aware': '0.018188', 'loss': '9.133036', time: 0.544, eta: 3 days, 2:59:30
2020-09-10 18:06:01,443-INFO: iter: 3900, lr: 0.000975, 'loss_xy': '0.581221', 'loss_wh': '0.586602', 'loss_obj': '3.305507', 'loss_cls': '2.463003', 'loss_iou': '2.171795', 'loss_iou_aware': '0.016012', 'loss': '9.102219', time: 0.550, eta: 3 days, 3:48:00
2020-09-10 18:06:54,202-INFO: iter: 4000, lr: 0.001000, 'loss_xy': '0.583331', 'loss_wh': '0.616600', 'loss_obj': '3.176893', 'loss_cls': '2.360175', 'loss_iou': '2.219807', 'loss_iou_aware': '0.015864', 'loss': '9.072341', time: 0.526, eta: 3 days, 0:28:04
2020-09-10 18:07:48,287-INFO: iter: 4100, lr: 0.001000, 'loss_xy': '0.580252', 'loss_wh': '0.612756', 'loss_obj': '3.205532', 'loss_cls': '2.373779', 'loss_iou': '2.197740', 'loss_iou_aware': '0.015548', 'loss': '8.962381', time: 0.541, eta: 3 days, 2:35:08
W0910 18:08:32.227979  1947 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0910 18:08:32.228004  1947 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0910 18:08:32.228008  1947 init.cc:231] The detail failure signal is:

W0910 18:08:32.228010  1947 init.cc:234] *** Aborted at 1599732512 (unix time) try "date -d @1599732512" if you are using GNU date ***
W0910 18:08:32.228685  1947 init.cc:234] PC: @                0x0 (unknown)
W0910 18:08:32.229272  1947 init.cc:234] *** SIGSEGV (@0x10) received by PID 1872 (TID 0x7f8eeefdc700) from PID 16; stack trace: ***
W0910 18:08:32.229846  1947 init.cc:234]     @     0x7f91ff75cfd0 (unknown)
W0910 18:08:32.231119  1947 init.cc:234]     @     0x7f916d73908a paddle::platform::proto::MessageDesc::MergePartialFromCodedStream()
W0910 18:08:32.233204  1947 init.cc:234]     @     0x7f916d739971 paddle::platform::proto::AllMessageDesc::MergePartialFromCodedStream()
W0910 18:08:32.235473  1947 init.cc:234]     @     0x7f916d73a3dd paddle::platform::proto::cudaerrorDesc::MergePartialFromCodedStream()
W0910 18:08:32.237517  1947 init.cc:234]     @     0x7f916cfed270 google::protobuf::MessageLite::ParseFromCodedStream()
W0910 18:08:32.239045  1947 init.cc:234]     @     0x7f916cfed41a google::protobuf::MessageLite::ParseFromZeroCopyStream()
W0910 18:08:32.240597  1947 init.cc:234]     @     0x7f916cff3b79 google::protobuf::Message::ParseFromIstream()
W0910 18:08:32.242121  1947 init.cc:234]     @     0x7f916a0b90e2 paddle::platform::build_nvidia_error_msg()
W0910 18:08:32.243868  1947 init.cc:234]     @     0x7f916d6cb2df paddle::platform::GpuMemcpyAsync()
W0910 18:08:32.245286  1947 init.cc:234]     @     0x7f916d69b44e paddle::memory::Copy<>()
W0910 18:08:32.246776  1947 init.cc:234]     @     0x7f916a4bb272 paddle::operators::SumToLoDTensor<>()
W0910 18:08:32.248064  1947 init.cc:234]     @     0x7f916a4c30b8 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EINS0_9operators9SumKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_iEENSA_ISB_lEENSA_ISB_NS7_7float16EEEEEclEPKcSK_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W0910 18:08:32.250301  1947 init.cc:234]     @     0x7f916d5f9830 paddle::framework::OperatorWithKernel::RunImpl()
W0910 18:08:32.253427  1947 init.cc:234]     @     0x7f916d5fa021 paddle::framework::OperatorWithKernel::RunImpl()
W0910 18:08:32.255237  1947 init.cc:234]     @     0x7f916d5f2fe1 paddle::framework::OperatorBase::Run()
W0910 18:08:32.257776  1947 init.cc:234]     @     0x7f916d3028c6 paddle::framework::details::ComputationOpHandle::RunImpl()
W0910 18:08:32.260211  1947 init.cc:234]     @     0x7f916d2a8aa1 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
W0910 18:08:32.262575  1947 init.cc:234]     @     0x7f916d2a659f paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
W0910 18:08:32.263298  1947 init.cc:234]     @     0x7f916d2a6864 _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
W0910 18:08:32.266253  1947 init.cc:234]     @     0x7f916a09d413 std::_Function_handler<>::_M_invoke()
W0910 18:08:32.269752  1947 init.cc:234]     @     0x7f9169e97107 std::__future_base::_State_base::_M_do_set()
W0910 18:08:32.270504  1947 init.cc:234]     @     0x7f91ff50e827 __pthread_once_slow
W0910 18:08:32.271306  1947 init.cc:234]     @     0x7f916d2a2a32 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
W0910 18:08:32.273910  1947 init.cc:234]     @     0x7f9169e99564 _ZZN10ThreadPoolC1EmENKUlvE_clEv
W0910 18:08:32.274421  1947 init.cc:234]     @     0x7f91d0aaca50 (unknown)
W0910 18:08:32.275040  1947 init.cc:234]     @     0x7f91ff5066db start_thread
W0910 18:08:32.275523  1947 init.cc:234]     @     0x7f91ff83fa3f clone
W0910 18:08:32.275995  1947 init.cc:234]     @                0x0 (unknown)
qingqing01 commented 4 years ago

@yeyupiaoling 从错误看,没看到什么特殊的。 " iter: 4100"有什么特殊的吗? 以及你YML中的worker_num和bufsize是多少?

yeyupiaoling commented 4 years ago

@qingqing01 没有呢,这样也是随机的,不一定是这个iterate, 配置文件:https://github.com/yeyupiaoling/PP-YOLO/blob/master/configs/ppyolo.yml

yeyupiaoling commented 4 years ago

应该跟这个错误有关:https://github.com/PaddlePaddle/PaddleDetection/issues/1379