PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.9k stars 2.91k forks source link

FatalError: A serious error (Segmentation fault) is detected by the operating system. (at /paddle/paddle/fluid/platform/init.cc:303) #5131

Open wwdok opened 3 years ago

wwdok commented 3 years ago

大家好,我在按照这个教程运行video_tag的样例代码时,遇到了如下报错,请问有谁知道什么原因吗?谢谢!

(base) user@user-TUF-Gaming-FX506LU-FX506LU:~/Repo/PaddlePaddle/models/PaddleCV/video/application/video_tag$ python videotag_test.py
Namespace(extractor_config='configs/tsn.yaml', extractor_name='TSN', extractor_weights='weights/tsn', filelist='./data/VideoTag_test.list', label_file='label_3396.txt', predictor_config='configs/attention_lstm.yaml', predictor_name='AttentionLSTM', predictor_weights='weights/attention_lstm', save_dir='data/VideoTag_results', use_gpu=True)
[INFO: videotag_test.py:  240]: Namespace(extractor_config='configs/tsn.yaml', extractor_name='TSN', extractor_weights='weights/tsn', filelist='./data/VideoTag_test.list', label_file='label_3396.txt', predictor_config='configs/attention_lstm.yaml', predictor_name='AttentionLSTM', predictor_weights='weights/attention_lstm', save_dir='data/VideoTag_results', use_gpu=True)
W1222 17:25:32.329594 11924 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 11.1, Runtime API Version: 10.2
W1222 17:25:32.357901 11924 device_context.cc:346] device: 0, cuDNN Version: 8.0.
[INFO: videotag_test.py:  138]: load extractor weights from weights/tsn
[INFO: tsn.py:  155]: Load pretrain weights from weights/tsn, exclude fc layer.
===pretrain=== weights/tsn

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: A serious error (Segmentation fault) is detected by the operating system. (at /paddle/paddle/fluid/platform/init.cc:303)
  [TimeInfo: *** Aborted at 1608629142 (unix time) try "date -d @1608629142" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 11924 (TID 0x7f5a6801a740) from PID 0 ***]

段错误 (核心已转储)

我的电脑系统是ubuntu18.04,paddlepaddle版本是2.0.0rc0

bjjwwang commented 3 years ago

好 收到 尽快研究下回复

wwdok commented 3 years ago

我在跑CTCN时也遇到相同的问题了:

(base) root@LAB_VM:/home/PaddlePaddle/models/PaddleCV/video# bash run.sh predict CTCN ./configs/ctcn.yaml
predict CTCN ./configs/ctcn.yaml
DALI is not installed, you can improve performance if use DALI
[INFO: predict.py:  200]: Namespace(batch_size=1, config='./configs/ctcn.yaml', filelist=None, infer_topk=20, log_interval=1, model_name='CTCN', save_dir='data/predict_results', use_gpu=True, video_path='', weights=None)
[INFO: config_utils.py:   69]: ---------------- Infer Arguments ----------------
[INFO: config_utils.py:   72]: MODEL:
[INFO: config_utils.py:   74]:     name:CTCN
[INFO: config_utils.py:   74]:     num_classes:201
[INFO: config_utils.py:   74]:     img_size:512
[INFO: config_utils.py:   74]:     concept_size:402
[INFO: config_utils.py:   74]:     num_anchors:7
[INFO: config_utils.py:   74]:     total_num_anchors:1785
[INFO: config_utils.py:   74]:     snippet_length:1
[INFO: config_utils.py:   74]:     root:./data/dataset/ctcn/feats
[INFO: config_utils.py:   72]: TRAIN:
[INFO: config_utils.py:   74]:     epoch:35
[INFO: config_utils.py:   74]:     filelist:./data/dataset/ctcn/Activity1.3_train_rgb.listformat
[INFO: config_utils.py:   74]:     rgb:senet152-201cls-rgb-70.3-5seg-331data_331img_train
[INFO: config_utils.py:   74]:     flow:senet152-201cls-flow-60.9-5seg-331data_train
[INFO: config_utils.py:   74]:     batch_size:16
[INFO: config_utils.py:   74]:     num_threads:8
[INFO: config_utils.py:   74]:     use_gpu:True
[INFO: config_utils.py:   74]:     num_gpus:8
[INFO: config_utils.py:   74]:     learning_rate:0.0005
[INFO: config_utils.py:   74]:     learning_rate_decay:0.1
[INFO: config_utils.py:   74]:     lr_decay_iter:9000
[INFO: config_utils.py:   74]:     l2_weight_decay:0.0001
[INFO: config_utils.py:   74]:     momentum:0.9
[INFO: config_utils.py:   72]: VALID:
[INFO: config_utils.py:   74]:     filelist:./data/dataset/ctcn/Activity1.3_val_rgb.listformat
[INFO: config_utils.py:   74]:     rgb:senet152-201cls-rgb-70.3-5seg-331data_331img_val
[INFO: config_utils.py:   74]:     flow:senet152-201cls-flow-60.9-5seg-331data_val
[INFO: config_utils.py:   74]:     batch_size:16
[INFO: config_utils.py:   74]:     num_threads:8
[INFO: config_utils.py:   74]:     use_gpu:True
[INFO: config_utils.py:   74]:     num_gpus:8
[INFO: config_utils.py:   72]: TEST:
[INFO: config_utils.py:   74]:     filelist:./data/dataset/ctcn/Activity1.3_val_rgb.listformat
[INFO: config_utils.py:   74]:     rgb:senet152-201cls-rgb-70.3-5seg-331data_331img_val
[INFO: config_utils.py:   74]:     flow:senet152-201cls-flow-60.9-5seg-331data_val
[INFO: config_utils.py:   74]:     class_label_file:./data/dataset/ctcn/labels.txt
[INFO: config_utils.py:   74]:     video_duration_file:./data/dataset/ctcn/val_duration_frame.list
[INFO: config_utils.py:   74]:     batch_size:1
[INFO: config_utils.py:   74]:     num_threads:1
[INFO: config_utils.py:   74]:     score_thresh:0.001
[INFO: config_utils.py:   74]:     nms_thresh:0.8
[INFO: config_utils.py:   74]:     sigma_thresh:0.9
[INFO: config_utils.py:   74]:     soft_thresh:0.004
[INFO: config_utils.py:   72]: INFER:
[INFO: config_utils.py:   74]:     filelist:./data/dataset/ctcn/infer.list
[INFO: config_utils.py:   74]:     rgb:senet152-201cls-rgb-70.3-5seg-331data_331img_val
[INFO: config_utils.py:   74]:     flow:senet152-201cls-flow-60.9-5seg-331data_val
[INFO: config_utils.py:   74]:     batch_size:1
[INFO: config_utils.py:   74]:     num_threads:1
[INFO: config_utils.py:   75]: -------------------------------------------------
/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/framework.py:2383: DeprecationWarning: an integer is required (got type paddle.fluid.core_avx.VarType).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
  self.desc._set_attr(name, val)
/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/framework.py:2383: DeprecationWarning: an integer is required (got type paddle.fluid.core_avx.op_proto_and_checker_maker.OpRole).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
  self.desc._set_attr(name, val)
/root/anaconda3/lib/python3.8/site-packages/paddle/fluid/layers/math_op_patch.py:273: UserWarning: /home/PaddlePaddle/models/PaddleCV/video/models/ctcn/fpn_ctcn.py:76
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  warnings.warn(
W1225 02:45:15.461153 47159 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 11.0, Runtime API Version: 10.2
W1225 02:45:15.464532 47159 device_context.cc:346] device: 0, cuDNN Version: 8.0.
[INFO: detection_metrics.py:   68]: Resetting infer metrics...
[INFO: detection_metrics.py:   68]: Resetting infer metrics...
./data/dataset/ctcn/feats/senet152-201cls-rgb-70.3-5seg-331data_331img_val/JDg--pjY5gg.pkl
./data/dataset/ctcn/feats/senet152-201cls-flow-60.9-5seg-331data_val/JDg--pjY5gg.pkl

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: A serious error (Segmentation fault) is detected by the operating system. (at /paddle/paddle/fluid/platform/init.cc:303)
  [TimeInfo: *** Aborted at 1608864329 (unix time) try "date -d @1608864329" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 47159 (TID 0x7fd5bc71d740) from PID 0 ***]

run.sh: line 107: 47159 Segmentation fault      (core dumped) python predict.py --model_name=$name --config=$configs --log_interval=$log_interval --use_gpu=$use_gpu --video_path=''
indrasweb commented 3 years ago

Also got this error running with a v100 CUDA11. No idea how to fix.

indrasweb commented 3 years ago

Managed to get it running by upgrading to CUDA11.2 https://developer.nvidia.com/cuda-downloads and installing w/ docker:

3) sudo docker run --name ppocr --gpus all -v $PWD:/paddle --shm-size=32G --network=host -it paddlepaddle/paddle:2.0.0rc1-gpu-cuda11.0-cudnn8 /bin/bash
4) python3.8 -m pip install paddleocr

However, inference is 3.5x slower than paddleocr1.1 (installed without docker). Any ideas why this might be? Whilst the accuracy is better, the performance hit is undesirable.