PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.75k stars 7.68k forks source link

[Text Detection]: 'Segmentation Fault is detected by the operating system' error while training #4784

Closed lannguyen0910 closed 2 years ago

lannguyen0910 commented 2 years ago

I've faced this error many times when i tried to train on a specific text detection's model.

My paddle's setup on Colab:

pip3 install paddlepaddle
pip3 install paddlepaddle-gpu

I also try another setup

pip3 install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
pip3 install "paddleocr>=2.0.1"

But it still didn't work

Below is my log.

-----------  Configuration Arguments -----------
backend: auto
elastic_server: None
force: False
gpus: 0
heter_worker_num: None
heter_workers: 
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
run_mode: None
scale: 0
server_num: None
servers: 
training_script: tools/train.py
training_script_args: ['-c', 'configs/ch_det_distill.yml']
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-11-29 00:53:11,656 launch.py:416] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2021-11-29 00:53:11,658 launch_utils.py:527] Local start 1 processes. First process distributed environment info (Only For Debug): 
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                       PADDLE_TRAINER_ID                        0                      |
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:34399               |
    |                     PADDLE_TRAINERS_NUM                        1                      |
    |                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:34399               |
    |                     PADDLE_RANK_IN_NODE                        0                      |
    |                 PADDLE_LOCAL_DEVICE_IDS                        0                      |
    |                 PADDLE_WORLD_DEVICE_IDS                        0                      |
    |                     FLAGS_selected_gpus                        0                      |
    |             FLAGS_selected_accelerators                        0                      |
    +=======================================================================================+

INFO 2021-11-29 00:53:11,658 launch_utils.py:531] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:732 idx:0
/usr/local/lib/python3.7/dist-packages/skimage/data/__init__.py:107: DeprecationWarning: 
    Importing file_hash from pooch.utils is DEPRECATED. Please import from the
    top-level namespace (`from pooch import file_hash`) instead, which is fully
    backwards compatible with pooch >= 0.1.

  return file_hash(path) == expected_hash
[2021/11/29 00:53:13] root INFO: Architecture : 
[2021/11/29 00:53:13] root INFO:     Models : 
[2021/11/29 00:53:13] root INFO:         Student : 
[2021/11/29 00:53:13] root INFO:             Backbone : 
[2021/11/29 00:53:13] root INFO:                 disable_se : True
[2021/11/29 00:53:13] root INFO:                 model_name : large
[2021/11/29 00:53:13] root INFO:                 name : MobileNetV3
[2021/11/29 00:53:13] root INFO:                 scale : 0.5
[2021/11/29 00:53:13] root INFO:             Head : 
[2021/11/29 00:53:13] root INFO:                 k : 50
[2021/11/29 00:53:13] root INFO:                 name : DBHead
[2021/11/29 00:53:13] root INFO:             Neck : 
[2021/11/29 00:53:13] root INFO:                 name : DBFPN
[2021/11/29 00:53:13] root INFO:                 out_channels : 96
[2021/11/29 00:53:13] root INFO:             algorithm : DB
[2021/11/29 00:53:13] root INFO:             freeze_params : False
[2021/11/29 00:53:13] root INFO:             model_type : det
[2021/11/29 00:53:13] root INFO:             return_all_feats : False
[2021/11/29 00:53:13] root INFO:         Student2 : 
[2021/11/29 00:53:13] root INFO:             Backbone : 
[2021/11/29 00:53:13] root INFO:                 disable_se : True
[2021/11/29 00:53:13] root INFO:                 model_name : large
[2021/11/29 00:53:13] root INFO:                 name : MobileNetV3
[2021/11/29 00:53:13] root INFO:                 scale : 0.5
[2021/11/29 00:53:13] root INFO:             Head : 
[2021/11/29 00:53:13] root INFO:                 k : 50
[2021/11/29 00:53:13] root INFO:                 name : DBHead
[2021/11/29 00:53:13] root INFO:             Neck : 
[2021/11/29 00:53:13] root INFO:                 name : DBFPN
[2021/11/29 00:53:13] root INFO:                 out_channels : 96
[2021/11/29 00:53:13] root INFO:             Transform : None
[2021/11/29 00:53:13] root INFO:             algorithm : DB
[2021/11/29 00:53:13] root INFO:             freeze_params : False
[2021/11/29 00:53:13] root INFO:             model_type : det
[2021/11/29 00:53:13] root INFO:             return_all_feats : False
[2021/11/29 00:53:13] root INFO:         Teacher : 
[2021/11/29 00:53:13] root INFO:             Backbone : 
[2021/11/29 00:53:13] root INFO:                 layers : 18
[2021/11/29 00:53:13] root INFO:                 name : ResNet
[2021/11/29 00:53:13] root INFO:             Head : 
[2021/11/29 00:53:13] root INFO:                 k : 50
[2021/11/29 00:53:13] root INFO:                 name : DBHead
[2021/11/29 00:53:13] root INFO:             Neck : 
[2021/11/29 00:53:13] root INFO:                 name : DBFPN
[2021/11/29 00:53:13] root INFO:                 out_channels : 256
[2021/11/29 00:53:13] root INFO:             Transform : None
[2021/11/29 00:53:13] root INFO:             algorithm : DB
[2021/11/29 00:53:13] root INFO:             freeze_params : True
[2021/11/29 00:53:13] root INFO:             model_type : det
[2021/11/29 00:53:13] root INFO:             return_all_feats : False
[2021/11/29 00:53:13] root INFO:     algorithm : Distillation
[2021/11/29 00:53:13] root INFO:     name : DistillationModel
[2021/11/29 00:53:13] root INFO: Eval : 
[2021/11/29 00:53:13] root INFO:     dataset : 
[2021/11/29 00:53:13] root INFO:         data_dir : ./dataset_det_v_3/
[2021/11/29 00:53:13] root INFO:         label_file_list : ['./dataset_det_v_3/eval/Label.txt']
[2021/11/29 00:53:13] root INFO:         name : SimpleDataSet
[2021/11/29 00:53:13] root INFO:         transforms : 
[2021/11/29 00:53:13] root INFO:             DecodeImage : 
[2021/11/29 00:53:13] root INFO:                 channel_first : False
[2021/11/29 00:53:13] root INFO:                 img_mode : BGR
[2021/11/29 00:53:13] root INFO:             DetLabelEncode : None
[2021/11/29 00:53:13] root INFO:             DetResizeForTest : None
[2021/11/29 00:53:13] root INFO:             NormalizeImage : 
[2021/11/29 00:53:13] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/11/29 00:53:13] root INFO:                 order : hwc
[2021/11/29 00:53:13] root INFO:                 scale : 1./255.
[2021/11/29 00:53:13] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/11/29 00:53:13] root INFO:             ToCHWImage : None
[2021/11/29 00:53:13] root INFO:             KeepKeys : 
[2021/11/29 00:53:13] root INFO:                 keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2021/11/29 00:53:13] root INFO:     loader : 
[2021/11/29 00:53:13] root INFO:         batch_size_per_card : 1
[2021/11/29 00:53:13] root INFO:         drop_last : False
[2021/11/29 00:53:13] root INFO:         num_workers : 2
[2021/11/29 00:53:13] root INFO:         shuffle : False
[2021/11/29 00:53:13] root INFO: Global : 
[2021/11/29 00:53:13] root INFO:     cal_metric_during_train : False
[2021/11/29 00:53:13] root INFO:     checkpoints : None
[2021/11/29 00:53:13] root INFO:     debug : False
[2021/11/29 00:53:13] root INFO:     distributed : False
[2021/11/29 00:53:13] root INFO:     epoch_num : 1000
[2021/11/29 00:53:13] root INFO:     eval_batch_step : [0, 500]
[2021/11/29 00:53:13] root INFO:     infer_img : doc/imgs_en/img_10.jpg
[2021/11/29 00:53:13] root INFO:     log_smooth_window : 20
[2021/11/29 00:53:13] root INFO:     pretrained_model : ./pretrain_models/ch_PP-OCRv2_det_distill_train/best_accuracy
[2021/11/29 00:53:13] root INFO:     print_batch_step : 2
[2021/11/29 00:53:13] root INFO:     save_epoch_step : 10
[2021/11/29 00:53:13] root INFO:     save_inference_dir : None
[2021/11/29 00:53:13] root INFO:     save_model_dir : ../drive/MyDrive/LicensePlate/weights/det/ch_det_distill/
[2021/11/29 00:53:13] root INFO:     save_res_path : ./output/det_db/predicts_db.txt
[2021/11/29 00:53:13] root INFO:     use_gpu : True
[2021/11/29 00:53:13] root INFO:     use_visualdl : False
[2021/11/29 00:53:13] root INFO: Loss : 
[2021/11/29 00:53:13] root INFO:     loss_config_list : 
[2021/11/29 00:53:13] root INFO:         DistillationDilaDBLoss : 
[2021/11/29 00:53:13] root INFO:             alpha : 5
[2021/11/29 00:53:13] root INFO:             balance_loss : True
[2021/11/29 00:53:13] root INFO:             beta : 10
[2021/11/29 00:53:13] root INFO:             key : maps
[2021/11/29 00:53:13] root INFO:             main_loss_type : DiceLoss
[2021/11/29 00:53:13] root INFO:             model_name_pairs : [['Student', 'Teacher'], ['Student2', 'Teacher']]
[2021/11/29 00:53:13] root INFO:             ohem_ratio : 3
[2021/11/29 00:53:13] root INFO:             weight : 1.0
[2021/11/29 00:53:13] root INFO:         DistillationDMLLoss : 
[2021/11/29 00:53:13] root INFO:             key : maps
[2021/11/29 00:53:13] root INFO:             maps_name : thrink_maps
[2021/11/29 00:53:13] root INFO:             model_name_pairs : [['Student', 'Student2']]
[2021/11/29 00:53:13] root INFO:             weight : 1.0
[2021/11/29 00:53:13] root INFO:         DistillationDBLoss : 
[2021/11/29 00:53:13] root INFO:             alpha : 5
[2021/11/29 00:53:13] root INFO:             balance_loss : True
[2021/11/29 00:53:13] root INFO:             beta : 10
[2021/11/29 00:53:13] root INFO:             main_loss_type : DiceLoss
[2021/11/29 00:53:13] root INFO:             model_name_list : ['Student', 'Student2']
[2021/11/29 00:53:13] root INFO:             ohem_ratio : 3
[2021/11/29 00:53:13] root INFO:             weight : 1.0
[2021/11/29 00:53:13] root INFO:     name : CombinedLoss
[2021/11/29 00:53:13] root INFO: Metric : 
[2021/11/29 00:53:13] root INFO:     base_metric_name : DetMetric
[2021/11/29 00:53:13] root INFO:     key : Student
[2021/11/29 00:53:13] root INFO:     main_indicator : hmean
[2021/11/29 00:53:13] root INFO:     name : DistillationMetric
[2021/11/29 00:53:13] root INFO: Optimizer : 
[2021/11/29 00:53:13] root INFO:     beta1 : 0.9
[2021/11/29 00:53:13] root INFO:     beta2 : 0.999
[2021/11/29 00:53:13] root INFO:     lr : 
[2021/11/29 00:53:13] root INFO:         learning_rate : 0.001
[2021/11/29 00:53:13] root INFO:         name : Cosine
[2021/11/29 00:53:13] root INFO:         warmup_epoch : 2
[2021/11/29 00:53:13] root INFO:     name : Adam
[2021/11/29 00:53:13] root INFO:     regularizer : 
[2021/11/29 00:53:13] root INFO:         factor : 0
[2021/11/29 00:53:13] root INFO:         name : L2
[2021/11/29 00:53:13] root INFO: PostProcess : 
[2021/11/29 00:53:13] root INFO:     box_thresh : 0.6
[2021/11/29 00:53:13] root INFO:     max_candidates : 1000
[2021/11/29 00:53:13] root INFO:     model_name : ['Student', 'Student2', 'Teacher']
[2021/11/29 00:53:13] root INFO:     name : DistillationDBPostProcess
[2021/11/29 00:53:13] root INFO:     thresh : 0.3
[2021/11/29 00:53:13] root INFO:     unclip_ratio : 1.5
[2021/11/29 00:53:13] root INFO: Train : 
[2021/11/29 00:53:13] root INFO:     dataset : 
[2021/11/29 00:53:13] root INFO:         data_dir : ./dataset_det_v_3/
[2021/11/29 00:53:13] root INFO:         label_file_list : ['./dataset_det_v_3/train/Label.txt']
[2021/11/29 00:53:13] root INFO:         name : SimpleDataSet
[2021/11/29 00:53:13] root INFO:         ratio_list : [1.0]
[2021/11/29 00:53:13] root INFO:         transforms : 
[2021/11/29 00:53:13] root INFO:             DecodeImage : 
[2021/11/29 00:53:13] root INFO:                 channel_first : False
[2021/11/29 00:53:13] root INFO:                 img_mode : BGR
[2021/11/29 00:53:13] root INFO:             DetLabelEncode : None
[2021/11/29 00:53:13] root INFO:             CopyPaste : None
[2021/11/29 00:53:13] root INFO:             IaaAugment : 
[2021/11/29 00:53:13] root INFO:                 augmenter_args : 
[2021/11/29 00:53:13] root INFO:                     args : 
[2021/11/29 00:53:13] root INFO:                         p : 0.5
[2021/11/29 00:53:13] root INFO:                     type : Fliplr
[2021/11/29 00:53:13] root INFO:                     args : 
[2021/11/29 00:53:13] root INFO:                         rotate : [-10, 10]
[2021/11/29 00:53:13] root INFO:                     type : Affine
[2021/11/29 00:53:13] root INFO:                     args : 
[2021/11/29 00:53:13] root INFO:                         size : [0.5, 3]
[2021/11/29 00:53:13] root INFO:                     type : Resize
[2021/11/29 00:53:13] root INFO:             EastRandomCropData : 
[2021/11/29 00:53:13] root INFO:                 keep_ratio : True
[2021/11/29 00:53:13] root INFO:                 max_tries : 50
[2021/11/29 00:53:13] root INFO:                 size : [960, 960]
[2021/11/29 00:53:13] root INFO:             MakeBorderMap : 
[2021/11/29 00:53:13] root INFO:                 shrink_ratio : 0.4
[2021/11/29 00:53:13] root INFO:                 thresh_max : 0.7
[2021/11/29 00:53:13] root INFO:                 thresh_min : 0.3
[2021/11/29 00:53:13] root INFO:             MakeShrinkMap : 
[2021/11/29 00:53:13] root INFO:                 min_text_size : 8
[2021/11/29 00:53:13] root INFO:                 shrink_ratio : 0.4
[2021/11/29 00:53:13] root INFO:             NormalizeImage : 
[2021/11/29 00:53:13] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/11/29 00:53:13] root INFO:                 order : hwc
[2021/11/29 00:53:13] root INFO:                 scale : 1./255.
[2021/11/29 00:53:13] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/11/29 00:53:13] root INFO:             ToCHWImage : None
[2021/11/29 00:53:13] root INFO:             KeepKeys : 
[2021/11/29 00:53:13] root INFO:                 keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask']
[2021/11/29 00:53:13] root INFO:     loader : 
[2021/11/29 00:53:13] root INFO:         batch_size_per_card : 16
[2021/11/29 00:53:13] root INFO:         drop_last : False
[2021/11/29 00:53:13] root INFO:         num_workers : 2
[2021/11/29 00:53:13] root INFO:         shuffle : True
[2021/11/29 00:53:13] root INFO: train with paddle 2.2.0 and device CUDAPlace(0)
[2021/11/29 00:53:13] root INFO: Initialize indexs of datasets:['./dataset_det_v_3/train/Label.txt']
[2021/11/29 00:53:13] root INFO: Initialize indexs of datasets:['./dataset_det_v_3/eval/Label.txt']
W1129 00:53:13.685276   732 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 11.2, Runtime API Version: 10.2
W1129 00:53:13.689826   732 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2021/11/29 00:53:16] root INFO: loaded pretrained_model successful from ./pretrain_models/ch_PP-OCRv2_det_distill_train/best_accuracy.pdparams
[2021/11/29 00:53:16] root INFO: train dataloader has 790 iters
[2021/11/29 00:53:16] root INFO: valid dataloader has 1385 iters
[2021/11/29 00:53:16] root INFO: During the training process, after the 0th iteration, an evaluation is run every 500 iterations
[2021/11/29 00:53:16] root INFO: Initialize indexs of datasets:['./dataset_det_v_3/train/Label.txt']

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
1   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, paddle::platform::Place const&, bool, std::map<std::string, std::string, std::less<std::string >, std::allocator<std::pair<std::string const, std::string > > > const&)
2   paddle::imperative::PreparedOp::Run(paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
3   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
4   paddle::operators::BatchNormKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
5   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
6   paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
7   paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8   paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
9   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::AutoGrowthBestFitAllocator::FreeIdleChunks()

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1638147199 (unix time) try "date -d @1638147199" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x28) received by PID 732 (TID 0x7f24d0b8f780) from PID 40 ***]

INFO 2021-11-29 00:53:24,693 launch_utils.py:340] terminate all the procs
ERROR 2021-11-29 00:53:24,694 launch_utils.py:603] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2021-11-29 00:53:28,698 launch_utils.py:340] terminate all the procs
INFO 2021-11-29 00:53:28,698 launch.py:304] Local processes completed.
lannguyen0910 commented 2 years ago

My solution is to use one of the different paddle versions from: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html

You can try all of them till it works