Whenever I train text detection, the process ends up killed

ihholmes-p commented 3 years ago

python3 tools/train.py -c configs/det/det_mv3_db.yml -o Global.use_gpu=false
[2021/05/25 15:24:23] root INFO: Architecture :
[2021/05/25 15:24:23] root INFO:     Backbone :
[2021/05/25 15:24:23] root INFO:         model_name : large
[2021/05/25 15:24:23] root INFO:         name : MobileNetV3
[2021/05/25 15:24:23] root INFO:         scale : 0.5
[2021/05/25 15:24:23] root INFO:     Head :
[2021/05/25 15:24:23] root INFO:         k : 50
[2021/05/25 15:24:23] root INFO:         name : DBHead
[2021/05/25 15:24:23] root INFO:     Neck :
[2021/05/25 15:24:23] root INFO:         name : DBFPN
[2021/05/25 15:24:23] root INFO:         out_channels : 256
[2021/05/25 15:24:23] root INFO:     Transform : None
[2021/05/25 15:24:23] root INFO:     algorithm : DB
[2021/05/25 15:24:23] root INFO:     model_type : det
[2021/05/25 15:24:23] root INFO: Eval :
[2021/05/25 15:24:23] root INFO:     dataset :
[2021/05/25 15:24:23] root INFO:         data_dir : ./train_data/icdar2015/text_localization/
[2021/05/25 15:24:23] root INFO:         label_file_list : ['./train_data/icdar2015/text_localization/test_icdar2015_label.txt']
[2021/05/25 15:24:23] root INFO:         name : SimpleDataSet
[2021/05/25 15:24:23] root INFO:         transforms :
[2021/05/25 15:24:23] root INFO:             DecodeImage :
[2021/05/25 15:24:23] root INFO:                 channel_first : False
[2021/05/25 15:24:23] root INFO:                 img_mode : BGR
[2021/05/25 15:24:23] root INFO:             DetLabelEncode : None
[2021/05/25 15:24:23] root INFO:             DetResizeForTest :
[2021/05/25 15:24:23] root INFO:                 image_shape : [736, 1280]
[2021/05/25 15:24:23] root INFO:             NormalizeImage :
[2021/05/25 15:24:23] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/05/25 15:24:23] root INFO:                 order : hwc
[2021/05/25 15:24:23] root INFO:                 scale : 1./255.
[2021/05/25 15:24:23] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/05/25 15:24:23] root INFO:             ToCHWImage : None
[2021/05/25 15:24:23] root INFO:             KeepKeys :
[2021/05/25 15:24:23] root INFO:                 keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2021/05/25 15:24:23] root INFO:     loader :
[2021/05/25 15:24:23] root INFO:         batch_size_per_card : 1
[2021/05/25 15:24:23] root INFO:         drop_last : False
[2021/05/25 15:24:23] root INFO:         num_workers : 0
[2021/05/25 15:24:23] root INFO:         shuffle : False
[2021/05/25 15:24:23] root INFO:         use_shared_memory : False
[2021/05/25 15:24:23] root INFO: Global :
[2021/05/25 15:24:23] root INFO:     cal_metric_during_train : False
[2021/05/25 15:24:23] root INFO:     checkpoints : None
[2021/05/25 15:24:23] root INFO:     debug : False
[2021/05/25 15:24:23] root INFO:     distributed : False
[2021/05/25 15:24:23] root INFO:     epoch_num : 1200
[2021/05/25 15:24:23] root INFO:     eval_batch_step : [0, 2000]
[2021/05/25 15:24:23] root INFO:     infer_img : doc/imgs_en/img_10.jpg
[2021/05/25 15:24:23] root INFO:     log_smooth_window : 20
[2021/05/25 15:24:23] root INFO:     pretrained_model : ./pretrain_models/MobileNetV3_large_x0_5_pretrained
[2021/05/25 15:24:23] root INFO:     print_batch_step : 10
[2021/05/25 15:24:23] root INFO:     save_epoch_step : 1200
[2021/05/25 15:24:23] root INFO:     save_inference_dir : None
[2021/05/25 15:24:23] root INFO:     save_model_dir : ./output/db_mv3/
[2021/05/25 15:24:23] root INFO:     save_res_path : ./output/det_db/predicts_db.txt
[2021/05/25 15:24:23] root INFO:     use_gpu : False
[2021/05/25 15:24:23] root INFO:     use_visualdl : False
[2021/05/25 15:24:23] root INFO: Loss :
[2021/05/25 15:24:23] root INFO:     alpha : 5
[2021/05/25 15:24:23] root INFO:     balance_loss : True
[2021/05/25 15:24:23] root INFO:     beta : 10
[2021/05/25 15:24:23] root INFO:     main_loss_type : DiceLoss
[2021/05/25 15:24:23] root INFO:     name : DBLoss
[2021/05/25 15:24:23] root INFO:     ohem_ratio : 3
[2021/05/25 15:24:23] root INFO: Metric :
[2021/05/25 15:24:23] root INFO:     main_indicator : hmean
[2021/05/25 15:24:23] root INFO:     name : DetMetric
[2021/05/25 15:24:23] root INFO: Optimizer :
[2021/05/25 15:24:23] root INFO:     beta1 : 0.9
[2021/05/25 15:24:23] root INFO:     beta2 : 0.999
[2021/05/25 15:24:23] root INFO:     lr :
[2021/05/25 15:24:23] root INFO:         learning_rate : 0.001
[2021/05/25 15:24:23] root INFO:     name : Adam
[2021/05/25 15:24:23] root INFO:     regularizer :
[2021/05/25 15:24:23] root INFO:         factor : 0
[2021/05/25 15:24:23] root INFO:         name : L2
[2021/05/25 15:24:23] root INFO: PostProcess :
[2021/05/25 15:24:23] root INFO:     box_thresh : 0.6
[2021/05/25 15:24:23] root INFO:     max_candidates : 1000
[2021/05/25 15:24:23] root INFO:     name : DBPostProcess
[2021/05/25 15:24:23] root INFO:     thresh : 0.3
[2021/05/25 15:24:23] root INFO:     unclip_ratio : 1.5
[2021/05/25 15:24:23] root INFO: Train :
[2021/05/25 15:24:23] root INFO:     dataset :
[2021/05/25 15:24:23] root INFO:         data_dir : ./train_data/icdar2015/text_localization/
[2021/05/25 15:24:23] root INFO:         label_file_list : ['./train_data/icdar2015/text_localization/train_icdar2015_label.txt']
[2021/05/25 15:24:23] root INFO:         name : SimpleDataSet
[2021/05/25 15:24:23] root INFO:         ratio_list : [1.0]
[2021/05/25 15:24:23] root INFO:         transforms :
[2021/05/25 15:24:23] root INFO:             DecodeImage :
[2021/05/25 15:24:23] root INFO:                 channel_first : False
[2021/05/25 15:24:23] root INFO:                 img_mode : BGR
[2021/05/25 15:24:23] root INFO:             DetLabelEncode : None
[2021/05/25 15:24:23] root INFO:             IaaAugment :
[2021/05/25 15:24:23] root INFO:                 augmenter_args :
[2021/05/25 15:24:23] root INFO:                     args :
[2021/05/25 15:24:23] root INFO:                         p : 0.5
[2021/05/25 15:24:23] root INFO:                     type : Fliplr
[2021/05/25 15:24:23] root INFO:                     args :
[2021/05/25 15:24:23] root INFO:                         rotate : [-10, 10]
[2021/05/25 15:24:23] root INFO:                     type : Affine
[2021/05/25 15:24:23] root INFO:                     args :
[2021/05/25 15:24:23] root INFO:                         size : [0.5, 3]
[2021/05/25 15:24:23] root INFO:                     type : Resize
[2021/05/25 15:24:23] root INFO:             EastRandomCropData :
[2021/05/25 15:24:23] root INFO:                 keep_ratio : True
[2021/05/25 15:24:23] root INFO:                 max_tries : 50
[2021/05/25 15:24:23] root INFO:                 size : [640, 640]
[2021/05/25 15:24:23] root INFO:             MakeBorderMap :
[2021/05/25 15:24:23] root INFO:                 shrink_ratio : 0.4
[2021/05/25 15:24:23] root INFO:                 thresh_max : 0.7
[2021/05/25 15:24:23] root INFO:                 thresh_min : 0.3
[2021/05/25 15:24:23] root INFO:             MakeShrinkMap :
[2021/05/25 15:24:23] root INFO:                 min_text_size : 8
[2021/05/25 15:24:23] root INFO:                 shrink_ratio : 0.4
[2021/05/25 15:24:23] root INFO:             NormalizeImage :
[2021/05/25 15:24:23] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/05/25 15:24:23] root INFO:                 order : hwc
[2021/05/25 15:24:23] root INFO:                 scale : 1./255.
[2021/05/25 15:24:23] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/05/25 15:24:23] root INFO:             ToCHWImage : None
[2021/05/25 15:24:23] root INFO:             KeepKeys :
[2021/05/25 15:24:23] root INFO:                 keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask']
[2021/05/25 15:24:23] root INFO:     loader :
[2021/05/25 15:24:23] root INFO:         batch_size_per_card : 16
[2021/05/25 15:24:23] root INFO:         drop_last : False
[2021/05/25 15:24:23] root INFO:         num_workers : 0
[2021/05/25 15:24:23] root INFO:         shuffle : True
[2021/05/25 15:24:23] root INFO:         use_shared_memory : False
[2021/05/25 15:24:23] root INFO: train with paddle 2.0.0 and device CPUPlace
[2021/05/25 15:24:23] root INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/train_icdar2015_label.txt']
[2021/05/25 15:24:23] root INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/test_icdar2015_label.txt']
[2021/05/25 15:24:23] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x0_5_pretrained']
[2021/05/25 15:24:23] root INFO: train dataloader has 63 iters
[2021/05/25 15:24:23] root INFO: valid dataloader has 500 iters
[2021/05/25 15:24:23] root INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
[2021/05/25 15:24:23] root INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/train_icdar2015_label.txt']
W0525 15:29:20.668766  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:34:17.057350  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:39:39.043627  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:44:48.151059  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:49:35.960748  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:54:11.706876  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:54:36.844774  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 15:57:18.659723  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 16:00:07.759929  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 16:03:22.694195  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 16:06:48.739529  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 16:07:52.615046  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
W0525 16:08:13.482831  8966 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
Killed

I've tried on two different machines and I have this problem. I followed the instructions in the guide exactly, and used the same model, data, and config file

LDOUBLEV commented 3 years ago

It may be the memory OOM issue. Have you tried small batch_size_per_card?

kbrajwani commented 3 years ago

Hey my case is also the same. The difference in my case it is running for 40 iter then it's killed. i have 4 batch_size_per_card and 2 workers.

W0609 07:10:35.864583 13827 sampler.cpp:139] bvar is busy at sampling for 2 seconds!
ERROR:root:DataLoader reader thread raised an exception!
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 482, in _get_data
    data = self._data_queue.get(timeout=self._timeout)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 411, in _thread_loop
    batch = self._get_data()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 498, in _get_data
    "pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 13839

Traceback (most recent call last):
  File "tools/train.py", line 127, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 104, in main
    eval_class, pre_best_model_dict, logger, vdl_writer)
  File "/home/ubuntu/PaddleOCR/tools/program.py", line 205, in train
    for idx, batch in enumerate(train_dataloader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 585, in __next__
    data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)

Have you found any solution?

paddle-bot-old[bot] commented 3 years ago

Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

PaddlePaddle / PaddleOCR

Whenever I train text detection, the process ends up killed #2914