Closed limaopeng1 closed 2 years ago
看报错不像是因为显存不够才报错的,运行命令里加上 --log_dir=./debug/ 如下: python3 -m paddle.distributed.launch --log_dir=./debug/ --gpus '0,1,2,3,4,5,6,7' tools/train.py -c configs/rec/rec_mv3_none_bilstm_ctc.yml
报错之后,看下debug文件夹下的几个文件,有的文件里有详细的报错信息
Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开(建议先拉取最新代码进行尝试),我们会继续跟进。
我使用4张2080ti(11G)卡训练,batch_size_per_card只能设置成16,但是每张gpu显存占用才3432M,batch_size_per_card设置成32就报错:
C++ Traceback (most recent call last):
0 paddle::framework::SignalHandle(char const*, int)
1 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
Error Message Summary:
FatalError:
Segmentation fault
is detected by the operating system.[TimeInfo: Aborted at 1631087015 (unix time) try "date -d @1631087015" if you are using GNU date ]
[SignalInfo: SIGSEGV (@0x0) received by PID 44302 (TID 0x7f4af1e78740) from PID 0 ]
INFO 2021-09-08 15:43:39,219 launch_utils.py:327] terminate all the procs
ERROR 2021-09-08 15:43:39,220 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2021-09-08 15:43:42,223 launch_utils.py:327] terminate all the procs
这是我训练使用的配置: Global: use_gpu: true epoch_num: 100 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec/r34_vd_tps_bilstm_ctc/ save_epoch_step: 3 eval_batch_step: [1000, 1000] cal_metric_during_train: True pretrained_model: checkpoints: save_inference_dir: use_visualdl: True infer_img: #doc/imgs_words_en/word_10.png character_dict_path: ./ppocr/utils/ppocr_keys_v1_add_pinyin.txt character_type: ch max_text_length: 100 infer_mode: False use_space_char: True save_res_path: ./output/rec/predicts_r34_vd_tps_bilstm_ctc.txt
Optimizer: name: Adam beta1: 0.9 beta2: 0.999 lr: name: Cosine learning_rate: 0.001 regularizer: name: 'L2' factor: 0.00001
Architecture: model_type: rec algorithm: CRNN Transform: Backbone: name: ResNet layers: 34 Neck: name: SequenceEncoder encoder_type: rnn hidden_size: 256 Head: name: CTCHead fc_decay: 0.00001
Loss: name: CTCLoss
PostProcess: name: CTCLabelDecode
Metric: name: RecMetric main_indicator: acc
Train: dataset: name: SimpleDataSet data_dir: /var/ftp/ocr/zyl/cn_en_recognize_synthesis_datas/ label_file_list: ["/root/limaopeng/paddle/ppocr_train_data/crnn/rec_gt_train.txt"] transforms:
Eval: dataset: name: SimpleDataSet data_dir: /var/ftp/ocr/zyl/cn_en_recognize_synthesis_datas/testset_imgs/ label_file_list: ["/root/limaopeng/paddle/ppocr_train_data/crnn/rec_gt_test.txt"] transforms: