Closed LinGuan0624 closed 2 years ago
DetResizeForTest: limit_side_len: 736 limit_type: min
改为
DetResizeForTest: limit_side_len: 1280 limit_type: max
试一下
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
- 系统环境/System Environment:Ubuntu 18.04
- 版本号/Version:
- Paddle:paddlepaddle-gpu-2.2.2
- PaddleOCR: paddleocr-2.5.0.2
- 问题相关组件/Related components:PSENet
- 运行指令/Command Code:python tools/train.py -c configs/det/det_r50_vd_pse.yml
- 配置文件如下:
Global: use_gpu: true epoch_num: 600 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/det_bank_pse/ save_epoch_step: 600 # evaluation is run every 125 iterations eval_batch_step: [ 0,1000 ] cal_metric_during_train: False pretrained_model: checkpoints: ./output/det_bank_pse/best_accuracy save_inference_dir: use_visualdl: False infer_img: doc/imgs_en/img_10.jpg save_res_path: ./output/det_pse/predicts_pse.txt Architecture: model_type: det algorithm: PSE Transform: Backbone: name: ResNet layers: 50 Neck: name: FPN out_channels: 256 Head: name: PSEHead hidden_dim: 256 out_channels: 7 Loss: name: PSELoss alpha: 0.7 ohem_ratio: 3 kernel_sample_mask: pred reduction: none Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Step learning_rate: 0.0001 step_size: 200 gamma: 0.1 regularizer: name: 'L2' factor: 0.0005 PostProcess: name: PSEPostProcess thresh: 0 box_thresh: 0.85 min_area: 16 box_type: box # 'box' or 'poly' scale: 1 Metric: name: DetMetric main_indicator: hmean Train: dataset: name: SimpleDataSet data_dir: ./train_data/det_bank_1/ label_file_list: - ./train_data/det_bank_1/train.txt ratio_list: [ 1.0 ] transforms: - DecodeImage: # load image img_mode: BGR channel_first: False - DetLabelEncode: # Class handling label - ColorJitter: brightness: 0.12549019607843137 saturation: 0.5 - IaaAugment: augmenter_args: - { 'type': Resize, 'args': { 'size': [ 0.5, 3 ] } } - { 'type': Fliplr, 'args': { 'p': 0.5 } } - { 'type': Affine, 'args': { 'rotate': [ -10, 10 ] } } - MakePseGt: kernel_num: 7 min_shrink_ratio: 0.4 size: 640 - RandomCropImgMask: size: [ 640,640 ] main_key: gt_text crop_keys: [ 'image', 'gt_text', 'gt_kernels', 'mask' ] - NormalizeImage: scale: 1./255. mean: [ 0.485, 0.456, 0.406 ] std: [ 0.229, 0.224, 0.225 ] order: 'hwc' - ToCHWImage: - KeepKeys: keep_keys: [ 'image', 'gt_text', 'gt_kernels', 'mask' ] # the order of the dataloader list loader: shuffle: True drop_last: False batch_size_per_card: 1 num_workers: 8 Eval: dataset: name: SimpleDataSet data_dir: ./train_data/det_bank_1/ label_file_list: - ./train_data/det_bank_1/test.txt ratio_list: [ 1.0 ] transforms: - DecodeImage: # load image img_mode: BGR channel_first: False - DetLabelEncode: # Class handling label - DetResizeForTest: limit_side_len: 736 limit_type: min - NormalizeImage: scale: 1./255. mean: [ 0.485, 0.456, 0.406 ] std: [ 0.229, 0.224, 0.225 ] order: 'hwc' - ToCHWImage: - KeepKeys: keep_keys: [ 'image', 'shape', 'polys', 'ignore_tags' ] loader: shuffle: False drop_last: False batch_size_per_card: 1 # must be 1 num_workers: 0
- 完整报错/Complete Error Message:
[2022/05/13 08:14:30] root INFO: epoch: [34/600], iter: 28150, lr: 0.000100, loss_text: 0.078825, iou_text: 0.881593, loss_kernels: 0.126785, iou_kernel: 0.642865, loss: 0.091140, reader_cost: 0.00032 s, batch_cost: 0.08557 s, samples: 10, ips: 11.68617 [2022/05/13 08:14:31] root INFO: epoch: [34/600], iter: 28160, lr: 0.000100, loss_text: 0.072574, iou_text: 0.896482, loss_kernels: 0.102324, iou_kernel: 0.688730, loss: 0.082951, reader_cost: 0.00018 s, batch_cost: 0.08810 s, samples: 10, ips: 11.35080 [2022/05/13 08:14:33] root INFO: epoch: [34/600], iter: 28170, lr: 0.000100, loss_text: 0.074684, iou_text: 0.893146, loss_kernels: 0.105411, iou_kernel: 0.679338, loss: 0.085408, reader_cost: 0.00023 s, batch_cost: 0.13131 s, samples: 10, ips: 7.61569 Exception in thread Thread-25: Traceback (most recent call last): File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data data = self._data_queue.get(timeout=self._timeout) File "/data/anaconda3/envs/OCR/lib/python3.7/multiprocessing/queues.py", line 105, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/data/anaconda3/envs/OCR/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/data/anaconda3/envs/OCR/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 3 workers exit unexpectedly, pids: 11884, 11885, 11886 Traceback (most recent call last): File "tools/train.py", line 148, in <module> main(config, device, logger, vdl_writer) File "tools/train.py", line 125, in main eval_class, pre_best_model_dict, logger, vdl_writer, scaler) File "/data/PaddleOCR-release-2.4/tools/program.py", line 221, in train for idx, batch in enumerate(train_dataloader): File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__ data = self._reader.read_next_var_list() SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)
-问题描述: 1.训练文字检测模型PSENet,训练时GPU占用量较低,eval时GPU占用特别高,而且eval完后,GPU占用量不会降下来。 训练时:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 33C P0 62W / 300W | 4668MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
eval时:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 34C P0 69W / 300W | 22732MiB / 32768MiB | 12% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
- GPU在高占用率的情况下还能跑几个epoch,但不久后便会报上面的错误
- 数据集没有问题,因为DB算法能正常训练
- 调整过num_workers, 设为0也会报错
- cuda-10.1 cudnn-7.6 @karlhorky @MissPenguin
Have you the solution for this problem? I am getting the same error on Google Colab.
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
Architecture: model_type: det algorithm: PSE Transform: Backbone: name: ResNet layers: 50 Neck: name: FPN out_channels: 256 Head: name: PSEHead hidden_dim: 256 out_channels: 7
Loss: name: PSELoss alpha: 0.7 ohem_ratio: 3 kernel_sample_mask: pred reduction: none
Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Step learning_rate: 0.0001 step_size: 200 gamma: 0.1 regularizer: name: 'L2' factor: 0.0005
PostProcess: name: PSEPostProcess thresh: 0 box_thresh: 0.85 min_area: 16 box_type: box # 'box' or 'poly' scale: 1
Metric: name: DetMetric main_indicator: hmean
Train: dataset: name: SimpleDataSet data_dir: ./train_data/det_bank_1/ label_file_list:
Eval: dataset: name: SimpleDataSet data_dir: ./train_data/det_bank_1/ label_file_list:
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/data/anaconda3/envs/OCR/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/data/anaconda3/envs/OCR/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop batch = self._get_data() File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data "pids: {}".format(len(failed_workers), pids)) RuntimeError: DataLoader 3 workers exit unexpectedly, pids: 11884, 11885, 11886
Traceback (most recent call last): File "tools/train.py", line 148, in
main(config, device, logger, vdl_writer)
File "tools/train.py", line 125, in main
eval_class, pre_best_model_dict, logger, vdl_writer, scaler)
File "/data/PaddleOCR-release-2.4/tools/program.py", line 221, in train
for idx, batch in enumerate(train_dataloader):
File "/data/anaconda3/envs/OCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in next
data = self._reader.read_next_varlist()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)
eval时: