单独训练文字检测板块，使用AIstudio训练，报错DataLoader reader thread raised an exception!

immense8342 commented 3 years ago

参考了这个文档进行finetune训练然后报错，好像是数据集没有成功加载？但是确实数据集已经下载了，直接打开目录也可以打开

# 单机单卡训练 det_r50_vd 模型
%cd ../PaddleOCR/

!pip install imgaug 
!pip install pyclipper 
!pip install lmdb 
!pip install Levenshtein

!python3 tools/train.py -c configs/det/det_r50_vd_db.yml \
     -o Global.pretrain_weights=./pretrain_models/ResNet50_vd_ssld_pretrained/
"""
# 单机多卡训练，通过 --gpus 参数设置使用的GPU ID
!python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/det/det_r50_vd_db.yml \
     -o Global.pretrain_weights=./pretrain_models/ResNet50_vd_ssld_pretrained/
"""

/home/aistudio/PaddleOCR
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: imgaug in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.4.0)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (1.6.3)
Requirement already satisfied: Pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (7.1.2)
Requirement already satisfied: numpy>=1.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (1.20.3)
Requirement already satisfied: opencv-python in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (4.1.1.26)
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (2.2.3)
Requirement already satisfied: Shapely in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (1.7.1)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (1.15.0)
Requirement already satisfied: imageio in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (2.6.1)
Requirement already satisfied: scikit-image>=0.14.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from imgaug) (0.18.1)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->imgaug) (2.8.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->imgaug) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->imgaug) (0.10.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->imgaug) (2019.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->imgaug) (2.4.2)
Requirement already satisfied: networkx>=2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-image>=0.14.2->imgaug) (2.4)
Requirement already satisfied: PyWavelets>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-image>=0.14.2->imgaug) (1.1.1)
Requirement already satisfied: tifffile>=2019.7.26 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-image>=0.14.2->imgaug) (2021.4.8)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->imgaug) (56.2.0)
Requirement already satisfied: decorator>=4.3.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from networkx>=2.0->scikit-image>=0.14.2->imgaug) (4.4.2)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: pyclipper in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.2.1)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lmdb in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.2.1)
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: Levenshtein in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (0.12.0)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Levenshtein) (56.2.0)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
[2021/05/28 09:49:48] root INFO: Architecture : 
[2021/05/28 09:49:48] root INFO:     Backbone : 
[2021/05/28 09:49:48] root INFO:         layers : 50
[2021/05/28 09:49:48] root INFO:         name : ResNet
[2021/05/28 09:49:48] root INFO:     Head : 
[2021/05/28 09:49:48] root INFO:         k : 50
[2021/05/28 09:49:48] root INFO:         name : DBHead
[2021/05/28 09:49:48] root INFO:     Neck : 
[2021/05/28 09:49:48] root INFO:         name : DBFPN
[2021/05/28 09:49:48] root INFO:         out_channels : 256
[2021/05/28 09:49:48] root INFO:     Transform : None
[2021/05/28 09:49:48] root INFO:     algorithm : DB
[2021/05/28 09:49:48] root INFO:     model_type : det
[2021/05/28 09:49:48] root INFO: Eval : 
[2021/05/28 09:49:48] root INFO:     dataset : 
[2021/05/28 09:49:48] root INFO:         data_dir : ./train_data/icdar2015/text_localization/
[2021/05/28 09:49:48] root INFO:         label_file_list : ['./train_data/icdar2015/text_localization/test_icdar2015_label.txt']
[2021/05/28 09:49:48] root INFO:         name : SimpleDataSet
[2021/05/28 09:49:48] root INFO:         transforms : 
[2021/05/28 09:49:48] root INFO:             DecodeImage : 
[2021/05/28 09:49:48] root INFO:                 channel_first : False
[2021/05/28 09:49:48] root INFO:                 img_mode : BGR
[2021/05/28 09:49:48] root INFO:             DetLabelEncode : None
[2021/05/28 09:49:48] root INFO:             DetResizeForTest : 
[2021/05/28 09:49:48] root INFO:                 image_shape : [736, 1280]
[2021/05/28 09:49:48] root INFO:             NormalizeImage : 
[2021/05/28 09:49:48] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/05/28 09:49:48] root INFO:                 order : hwc
[2021/05/28 09:49:48] root INFO:                 scale : 1./255.
[2021/05/28 09:49:48] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/05/28 09:49:48] root INFO:             ToCHWImage : None
[2021/05/28 09:49:48] root INFO:             KeepKeys : 
[2021/05/28 09:49:48] root INFO:                 keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2021/05/28 09:49:48] root INFO:     loader : 
[2021/05/28 09:49:48] root INFO:         batch_size_per_card : 1
[2021/05/28 09:49:48] root INFO:         drop_last : False
[2021/05/28 09:49:48] root INFO:         num_workers : 8
[2021/05/28 09:49:48] root INFO:         shuffle : False
[2021/05/28 09:49:48] root INFO: Global : 
[2021/05/28 09:49:48] root INFO:     cal_metric_during_train : False
[2021/05/28 09:49:48] root INFO:     checkpoints : None
[2021/05/28 09:49:48] root INFO:     debug : False
[2021/05/28 09:49:48] root INFO:     distributed : False
[2021/05/28 09:49:48] root INFO:     epoch_num : 1200
[2021/05/28 09:49:48] root INFO:     eval_batch_step : [0, 2000]
[2021/05/28 09:49:48] root INFO:     infer_img : doc/imgs_en/img_10.jpg
[2021/05/28 09:49:48] root INFO:     load_static_weights : True
[2021/05/28 09:49:48] root INFO:     log_smooth_window : 20
[2021/05/28 09:49:48] root INFO:     pretrain_weights : ./pretrain_models/ResNet50_vd_ssld_pretrained/
[2021/05/28 09:49:48] root INFO:     pretrained_model : ./pretrain_models/ResNet50_vd_ssld_pretrained
[2021/05/28 09:49:48] root INFO:     print_batch_step : 10
[2021/05/28 09:49:48] root INFO:     save_epoch_step : 1200
[2021/05/28 09:49:48] root INFO:     save_inference_dir : None
[2021/05/28 09:49:48] root INFO:     save_model_dir : ./output/det_r50_vd/
[2021/05/28 09:49:48] root INFO:     save_res_path : ./output/det_db/predicts_db.txt
[2021/05/28 09:49:48] root INFO:     use_gpu : True
[2021/05/28 09:49:48] root INFO:     use_visualdl : False
[2021/05/28 09:49:48] root INFO: Loss : 
[2021/05/28 09:49:48] root INFO:     alpha : 5
[2021/05/28 09:49:48] root INFO:     balance_loss : True
[2021/05/28 09:49:48] root INFO:     beta : 10
[2021/05/28 09:49:48] root INFO:     main_loss_type : DiceLoss
[2021/05/28 09:49:48] root INFO:     name : DBLoss
[2021/05/28 09:49:48] root INFO:     ohem_ratio : 3
[2021/05/28 09:49:48] root INFO: Metric : 
[2021/05/28 09:49:48] root INFO:     main_indicator : hmean
[2021/05/28 09:49:48] root INFO:     name : DetMetric
[2021/05/28 09:49:48] root INFO: Optimizer : 
[2021/05/28 09:49:48] root INFO:     beta1 : 0.9
[2021/05/28 09:49:48] root INFO:     beta2 : 0.999
[2021/05/28 09:49:48] root INFO:     lr : 
[2021/05/28 09:49:48] root INFO:         learning_rate : 0.001
[2021/05/28 09:49:48] root INFO:     name : Adam
[2021/05/28 09:49:48] root INFO:     regularizer : 
[2021/05/28 09:49:48] root INFO:         factor : 0
[2021/05/28 09:49:48] root INFO:         name : L2
[2021/05/28 09:49:48] root INFO: PostProcess : 
[2021/05/28 09:49:48] root INFO:     box_thresh : 0.7
[2021/05/28 09:49:48] root INFO:     max_candidates : 1000
[2021/05/28 09:49:48] root INFO:     name : DBPostProcess
[2021/05/28 09:49:48] root INFO:     thresh : 0.3
[2021/05/28 09:49:48] root INFO:     unclip_ratio : 1.5
[2021/05/28 09:49:48] root INFO: Train : 
[2021/05/28 09:49:48] root INFO:     dataset : 
[2021/05/28 09:49:48] root INFO:         data_dir : ./train_data/icdar2015/text_localization/
[2021/05/28 09:49:48] root INFO:         label_file_list : ['./train_data/icdar2015/text_localization/train_icdar2015_label.txt']
[2021/05/28 09:49:48] root INFO:         name : SimpleDataSet
[2021/05/28 09:49:48] root INFO:         ratio_list : [1.0]
[2021/05/28 09:49:48] root INFO:         transforms : 
[2021/05/28 09:49:48] root INFO:             DecodeImage : 
[2021/05/28 09:49:48] root INFO:                 channel_first : False
[2021/05/28 09:49:48] root INFO:                 img_mode : BGR
[2021/05/28 09:49:48] root INFO:             DetLabelEncode : None
[2021/05/28 09:49:48] root INFO:             IaaAugment : 
[2021/05/28 09:49:48] root INFO:                 augmenter_args : 
[2021/05/28 09:49:48] root INFO:                     args : 
[2021/05/28 09:49:48] root INFO:                         p : 0.5
[2021/05/28 09:49:48] root INFO:                     type : Fliplr
[2021/05/28 09:49:48] root INFO:                     args : 
[2021/05/28 09:49:48] root INFO:                         rotate : [-10, 10]
[2021/05/28 09:49:48] root INFO:                     type : Affine
[2021/05/28 09:49:48] root INFO:                     args : 
[2021/05/28 09:49:48] root INFO:                         size : [0.5, 3]
[2021/05/28 09:49:48] root INFO:                     type : Resize
[2021/05/28 09:49:48] root INFO:             EastRandomCropData : 
[2021/05/28 09:49:48] root INFO:                 keep_ratio : True
[2021/05/28 09:49:48] root INFO:                 max_tries : 50
[2021/05/28 09:49:48] root INFO:                 size : [640, 640]
[2021/05/28 09:49:48] root INFO:             MakeBorderMap : 
[2021/05/28 09:49:48] root INFO:                 shrink_ratio : 0.4
[2021/05/28 09:49:48] root INFO:                 thresh_max : 0.7
[2021/05/28 09:49:48] root INFO:                 thresh_min : 0.3
[2021/05/28 09:49:48] root INFO:             MakeShrinkMap : 
[2021/05/28 09:49:48] root INFO:                 min_text_size : 8
[2021/05/28 09:49:48] root INFO:                 shrink_ratio : 0.4
[2021/05/28 09:49:48] root INFO:             NormalizeImage : 
[2021/05/28 09:49:48] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/05/28 09:49:48] root INFO:                 order : hwc
[2021/05/28 09:49:48] root INFO:                 scale : 1./255.
[2021/05/28 09:49:48] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/05/28 09:49:48] root INFO:             ToCHWImage : None
[2021/05/28 09:49:48] root INFO:             KeepKeys : 
[2021/05/28 09:49:48] root INFO:                 keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask']
[2021/05/28 09:49:48] root INFO:     loader : 
[2021/05/28 09:49:48] root INFO:         batch_size_per_card : 16
[2021/05/28 09:49:48] root INFO:         drop_last : False
[2021/05/28 09:49:48] root INFO:         num_workers : 1
[2021/05/28 09:49:48] root INFO:         shuffle : True
[2021/05/28 09:49:48] root INFO: train with paddle 2.0.2 and device CUDAPlace(0)
[2021/05/28 09:49:48] root INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/train_icdar2015_label.txt']
[2021/05/28 09:49:48] root INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/test_icdar2015_label.txt']
W0528 09:49:48.468700 10893 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0528 09:49:48.473711 10893 device_context.cc:372] device: 0, cuDNN Version: 7.6.
[2021/05/28 09:49:54] root INFO: load pretrained model from ['./pretrain_models/ResNet50_vd_ssld_pretrained']
[2021/05/28 09:49:54] root INFO: train dataloader has 63 iters
[2021/05/28 09:49:54] root INFO: valid dataloader has 500 iters
[2021/05/28 09:49:54] root INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
[2021/05/28 09:49:54] root INFO: Initialize indexs of datasets:['./train_data/icdar2015/text_localization/train_icdar2015_label.txt']
2021-05-28 09:49:59,340 - ERROR - DataLoader reader thread raised an exception!
Traceback (most recent call last):
  File "tools/train.py", line 125, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 102, in main
    eval_class, pre_best_model_dict, logger, vdl_writer)
  File "/home/aistudio/PaddleOCR/tools/program.py", line 199, in train
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 684, in _get_data
    data = self._data_queue.get(timeout=self._timeout)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/queues.py", line 105, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 616, in _thread_loop
    batch = self._get_data()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 700, in _get_data
    "pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 10956
    for idx, batch in enumerate(train_dataloader):

  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in __next__
    data = self._reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

"\n# 单机多卡训练，通过 --gpus 参数设置使用的GPU ID\n!python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/det/det_r50_vd_db.yml      -o Global.pretrain_weights=./pretrain_models/ResNet50_vd_ssld_pretrained/\n"

好像是Bug？？采用

!python3 tools/train.py -c configs/det/det_mv3_db.yml \
     -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x0_5_pretrained/

又可以成功

麻烦大佬帮忙看看是怎么回事~~~Thanks♪(･ω･)ﾉ

littletomatodonkey commented 3 years ago

配置文件中的num workers修改为0试一下呢？

immense8342 commented 3 years ago

配置文件中的num workers修改为0试一下呢？

具体的项目链接在这点我

谢谢啊~好像可以运行了，但是这个num workers参数写成0有什么影响吗，我看det_mv3_db.yml里面也是8

查阅了资料，好像是“用于加载数据的子进程个数”

为啥det_mv3_db.yml又不影响呢？是我开启的机器太烂的缘故吗？V100也不太差了吧

littletomatodonkey commented 3 years ago

这个跟共享内存有关。df -h查看下你的/dev/shm是多大，要开始多进程的话，需要保证这个空间至少大于1G，我一般用的是8G左右

ddz-mark commented 3 years ago

同样的问题，单卡时可以成功，多卡时报错，不是用的 docker 环境

BothCats commented 3 years ago

那么直接关掉共享内存是不是就可以解决这个问题了

thgpddl commented 3 years ago

这个跟共享内存有关。df -h查看下你的/dev/shm是多大，要开始多进程的话，需要保证这个空间至少大于1G，我一般用的是8G左右

使用pp飞浆的notebook环境，将num_work改为0后有效。查了df -h根本没有/dev/shm，所以应该是pp飞浆环境没有共享内存

PaddlePaddle / PaddleOCR