识别模型显存问题

limaopeng1 commented 3 years ago

我使用4张2080ti（11G）卡训练，batch_size_per_card只能设置成16，但是每张gpu显存占用才3432M，batch_size_per_card设置成32就报错：

C++ Traceback (most recent call last):

0 paddle::framework::SignalHandle(char const*, int)
1 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

Error Message Summary:

FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: Aborted at 1631087015 (unix time) try "date -d @1631087015" if you are using GNU date ]
[SignalInfo: SIGSEGV (@0x0) received by PID 44302 (TID 0x7f4af1e78740) from PID 0 ]
INFO 2021-09-08 15:43:39,219 launch_utils.py:327] terminate all the procs
ERROR 2021-09-08 15:43:39,220 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-09-08 15:43:42,223 launch_utils.py:327] terminate all the procs

这是我训练使用的配置： Global: use_gpu: true epoch_num: 100 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec/r34_vd_tps_bilstm_ctc/ save_epoch_step: 3 eval_batch_step: [1000, 1000] cal_metric_during_train: True pretrained_model: checkpoints: save_inference_dir: use_visualdl: True infer_img: #doc/imgs_words_en/word_10.png character_dict_path: ./ppocr/utils/ppocr_keys_v1_add_pinyin.txt character_type: ch max_text_length: 100 infer_mode: False use_space_char: True save_res_path: ./output/rec/predicts_r34_vd_tps_bilstm_ctc.txt

Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 regularizer: name: 'L2' factor: 0.00001

Architecture: model_type: rec algorithm: CRNN Transform: Backbone: name: ResNet layers: 34 Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 256 Head: name: CTCHead fc_decay: 0.00001

Loss: name: CTCLoss

PostProcess: name: CTCLabelDecode

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: SimpleDataSet data_dir: /var/ftp/ocr/zyl/cn_en_recognize_synthesis_datas/ label_file_list: ["/root/limaopeng/paddle/ppocr_train_data/crnn/rec_gt_train.txt"] transforms:

DecodeImage: # load image img_mode: BGR channel_first: False
CTCLabelEncode: # Class handling label
RecResizeImg: image_shape: [3, 32, 320]
KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: True batch_size_per_card: 16 drop_last: True num_workers: 8

Eval: dataset: name: SimpleDataSet data_dir: /var/ftp/ocr/zyl/cn_en_recognize_synthesis_datas/testset_imgs/ label_file_list: ["/root/limaopeng/paddle/ppocr_train_data/crnn/rec_gt_test.txt"] transforms:

DecodeImage: # load image img_mode: BGR channel_first: False
CTCLabelEncode: # Class handling label
RecResizeImg: image_shape: [3, 32, 320]
KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: False drop_last: False batch_size_per_card: 16 num_workers: 4

LDOUBLEV commented 3 years ago

看报错不像是因为显存不够才报错的，运行命令里加上 --log_dir=./debug/ 如下： python3 -m paddle.distributed.launch --log_dir=./debug/ --gpus '0,1,2,3,4,5,6,7' tools/train.py -c configs/rec/rec_mv3_none_bilstm_ctc.yml

报错之后，看下debug文件夹下的几个文件，有的文件里有详细的报错信息

paddle-bot-old[bot] commented 2 years ago

Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

PaddlePaddle / PaddleOCR

识别模型显存问题 #3976