新人求解，训练模型时，在打印完日志‘During the training process, after the 0th iteration, an evaluation is run every 400 iterations’后，卡住了

Lijian500 commented 5 months ago

问题：训练模型时，在打印完日志‘During the training process, after the 0th iteration, an evaluation is run every 400 iterations’后，没有动静了（超过半个小时没有任何其他日志输出），需要如何排查问题。

系统环境/System Environment：windons 11
版本号/Version：Paddle：2.6.1 PaddleOCR：2.7 问题相关组件/Related components：
运行指令/Command Code：python tools/train.py -c pretrain_models/ch_PP-OCRv3_det_cml.yml
完整报错/Complete Error Message：

[2024/05/09 14:55:16] ppocr INFO: Architecture : [2024/05/09 14:55:16] ppocr INFO: Models : [2024/05/09 14:55:16] ppocr INFO: Student : [2024/05/09 14:55:16] ppocr INFO: Backbone : [2024/05/09 14:55:16] ppocr INFO: disable_se : True [2024/05/09 14:55:16] ppocr INFO: model_name : large [2024/05/09 14:55:16] ppocr INFO: name : MobileNetV3 [2024/05/09 14:55:16] ppocr INFO: scale : 0.5 [2024/05/09 14:55:16] ppocr INFO: Head : [2024/05/09 14:55:16] ppocr INFO: k : 50 [2024/05/09 14:55:16] ppocr INFO: name : DBHead [2024/05/09 14:55:16] ppocr INFO: Neck : [2024/05/09 14:55:16] ppocr INFO: name : RSEFPN [2024/05/09 14:55:16] ppocr INFO: out_channels : 96 [2024/05/09 14:55:16] ppocr INFO: shortcut : True [2024/05/09 14:55:16] ppocr INFO: Transform : None [2024/05/09 14:55:16] ppocr INFO: algorithm : DB [2024/05/09 14:55:16] ppocr INFO: model_type : det [2024/05/09 14:55:16] ppocr INFO: pretrained : None [2024/05/09 14:55:16] ppocr INFO: Student2 : [2024/05/09 14:55:16] ppocr INFO: Backbone : [2024/05/09 14:55:16] ppocr INFO: disable_se : True [2024/05/09 14:55:16] ppocr INFO: model_name : large [2024/05/09 14:55:16] ppocr INFO: name : MobileNetV3 [2024/05/09 14:55:16] ppocr INFO: scale : 0.5 [2024/05/09 14:55:16] ppocr INFO: Head : [2024/05/09 14:55:16] ppocr INFO: k : 50 [2024/05/09 14:55:16] ppocr INFO: name : DBHead [2024/05/09 14:55:16] ppocr INFO: Neck : [2024/05/09 14:55:16] ppocr INFO: name : RSEFPN [2024/05/09 14:55:16] ppocr INFO: out_channels : 96 [2024/05/09 14:55:16] ppocr INFO: shortcut : True [2024/05/09 14:55:16] ppocr INFO: Transform : None [2024/05/09 14:55:16] ppocr INFO: algorithm : DB [2024/05/09 14:55:16] ppocr INFO: model_type : det [2024/05/09 14:55:16] ppocr INFO: pretrained : None [2024/05/09 14:55:16] ppocr INFO: Teacher : [2024/05/09 14:55:16] ppocr INFO: Backbone : [2024/05/09 14:55:16] ppocr INFO: in_channels : 3 [2024/05/09 14:55:16] ppocr INFO: layers : 50 [2024/05/09 14:55:16] ppocr INFO: name : ResNet_vd [2024/05/09 14:55:16] ppocr INFO: Head : [2024/05/09 14:55:16] ppocr INFO: k : 50 [2024/05/09 14:55:16] ppocr INFO: kernel_list : [7, 2, 2] [2024/05/09 14:55:16] ppocr INFO: name : DBHead [2024/05/09 14:55:16] ppocr INFO: Neck : [2024/05/09 14:55:16] ppocr INFO: name : LKPAN [2024/05/09 14:55:16] ppocr INFO: out_channels : 256 [2024/05/09 14:55:16] ppocr INFO: algorithm : DB [2024/05/09 14:55:16] ppocr INFO: freeze_params : True [2024/05/09 14:55:16] ppocr INFO: model_type : det [2024/05/09 14:55:16] ppocr INFO: return_all_feats : False [2024/05/09 14:55:16] ppocr INFO: algorithm : Distillation [2024/05/09 14:55:16] ppocr INFO: model_type : det [2024/05/09 14:55:16] ppocr INFO: name : DistillationModel [2024/05/09 14:55:16] ppocr INFO: Eval : [2024/05/09 14:55:16] ppocr INFO: dataset : [2024/05/09 14:55:16] ppocr INFO: data_dir : ./train_data/ [2024/05/09 14:55:16] ppocr INFO: label_file_list : ['./train_data/det/val.txt'] [2024/05/09 14:55:16] ppocr INFO: name : SimpleDataSet [2024/05/09 14:55:16] ppocr INFO: transforms : [2024/05/09 14:55:16] ppocr INFO: DecodeImage : [2024/05/09 14:55:16] ppocr INFO: channel_first : False [2024/05/09 14:55:16] ppocr INFO: img_mode : BGR [2024/05/09 14:55:16] ppocr INFO: DetLabelEncode : None [2024/05/09 14:55:16] ppocr INFO: DetResizeForTest : None [2024/05/09 14:55:16] ppocr INFO: NormalizeImage : [2024/05/09 14:55:16] ppocr INFO: mean : [0.485, 0.456, 0.406] [2024/05/09 14:55:16] ppocr INFO: order : hwc [2024/05/09 14:55:16] ppocr INFO: scale : 1./255. [2024/05/09 14:55:16] ppocr INFO: std : [0.229, 0.224, 0.225] [2024/05/09 14:55:16] ppocr INFO: ToCHWImage : None [2024/05/09 14:55:16] ppocr INFO: KeepKeys : [2024/05/09 14:55:16] ppocr INFO: keep_keys : ['image', 'shape', 'polys', 'ignore_tags'] [2024/05/09 14:55:16] ppocr INFO: loader : [2024/05/09 14:55:16] ppocr INFO: batch_size_per_card : 1 [2024/05/09 14:55:16] ppocr INFO: drop_last : False [2024/05/09 14:55:16] ppocr INFO: num_workers : 2 [2024/05/09 14:55:16] ppocr INFO: shuffle : False [2024/05/09 14:55:16] ppocr INFO: Global : [2024/05/09 14:55:16] ppocr INFO: amp_dtype : bfloat16 [2024/05/09 14:55:16] ppocr INFO: cal_metric_during_train : False [2024/05/09 14:55:16] ppocr INFO: checkpoints : None [2024/05/09 14:55:16] ppocr INFO: d2s_train_image_shape : [3, -1, -1] [2024/05/09 14:55:16] ppocr INFO: debug : False [2024/05/09 14:55:16] ppocr INFO: distributed : False [2024/05/09 14:55:16] ppocr INFO: epoch_num : 500 [2024/05/09 14:55:16] ppocr INFO: eval_batch_step : [0, 400] [2024/05/09 14:55:16] ppocr INFO: infer_img : doc/imgs_en/img_10.jpg [2024/05/09 14:55:16] ppocr INFO: log_smooth_window : 20 [2024/05/09 14:55:16] ppocr INFO: pretrained_model : ./pretrain_models/ch_PP-OCRv3_det_distill_train/ch_PP-OCRv3_det_distill_train/best_accuracy [2024/05/09 14:55:16] ppocr INFO: print_batch_step : 10 [2024/05/09 14:55:16] ppocr INFO: save_epoch_step : 100 [2024/05/09 14:55:16] ppocr INFO: save_inference_dir : None [2024/05/09 14:55:16] ppocr INFO: save_model_dir : ./output/ch_PP-OCR_v3_det/ [2024/05/09 14:55:16] ppocr INFO: save_res_path : ./checkpoints/det_db/predicts_db.txt [2024/05/09 14:55:16] ppocr INFO: use_gpu : False [2024/05/09 14:55:16] ppocr INFO: use_visualdl : False [2024/05/09 14:55:16] ppocr INFO: Loss : [2024/05/09 14:55:16] ppocr INFO: loss_config_list : [2024/05/09 14:55:16] ppocr INFO: DistillationDilaDBLoss : [2024/05/09 14:55:16] ppocr INFO: alpha : 5 [2024/05/09 14:55:16] ppocr INFO: balance_loss : True [2024/05/09 14:55:16] ppocr INFO: beta : 10 [2024/05/09 14:55:16] ppocr INFO: key : maps [2024/05/09 14:55:16] ppocr INFO: main_loss_type : DiceLoss [2024/05/09 14:55:16] ppocr INFO: model_name_pairs : [['Student', 'Teacher'], ['Student2', 'Teacher']] [2024/05/09 14:55:16] ppocr INFO: ohem_ratio : 3 [2024/05/09 14:55:16] ppocr INFO: weight : 1.0 [2024/05/09 14:55:16] ppocr INFO: DistillationDMLLoss : [2024/05/09 14:55:16] ppocr INFO: key : maps [2024/05/09 14:55:16] ppocr INFO: maps_name : thrink_maps [2024/05/09 14:55:16] ppocr INFO: model_name_pairs : ['Student', 'Student2'] [2024/05/09 14:55:16] ppocr INFO: weight : 1.0 [2024/05/09 14:55:16] ppocr INFO: DistillationDBLoss : [2024/05/09 14:55:16] ppocr INFO: alpha : 5 [2024/05/09 14:55:16] ppocr INFO: balance_loss : True [2024/05/09 14:55:16] ppocr INFO: beta : 10 [2024/05/09 14:55:16] ppocr INFO: main_loss_type : DiceLoss [2024/05/09 14:55:16] ppocr INFO: model_name_list : ['Student', 'Student2'] [2024/05/09 14:55:16] ppocr INFO: ohem_ratio : 3 [2024/05/09 14:55:16] ppocr INFO: weight : 1.0 [2024/05/09 14:55:16] ppocr INFO: name : CombinedLoss [2024/05/09 14:55:16] ppocr INFO: Metric : [2024/05/09 14:55:16] ppocr INFO: base_metric_name : DetMetric [2024/05/09 14:55:16] ppocr INFO: key : Student [2024/05/09 14:55:16] ppocr INFO: main_indicator : hmean [2024/05/09 14:55:16] ppocr INFO: name : DistillationMetric [2024/05/09 14:55:16] ppocr INFO: Optimizer : [2024/05/09 14:55:16] ppocr INFO: beta1 : 0.9 [2024/05/09 14:55:16] ppocr INFO: beta2 : 0.999 [2024/05/09 14:55:16] ppocr INFO: lr : [2024/05/09 14:55:16] ppocr INFO: learning_rate : 0.001 [2024/05/09 14:55:16] ppocr INFO: name : Cosine [2024/05/09 14:55:16] ppocr INFO: warmup_epoch : 2 [2024/05/09 14:55:16] ppocr INFO: name : Adam [2024/05/09 14:55:16] ppocr INFO: regularizer : [2024/05/09 14:55:16] ppocr INFO: factor : 5e-05 [2024/05/09 14:55:16] ppocr INFO: name : L2 [2024/05/09 14:55:16] ppocr INFO: PostProcess : [2024/05/09 14:55:16] ppocr INFO: box_thresh : 0.6 [2024/05/09 14:55:16] ppocr INFO: key : head_out [2024/05/09 14:55:16] ppocr INFO: max_candidates : 1000 [2024/05/09 14:55:16] ppocr INFO: model_name : ['Student'] [2024/05/09 14:55:16] ppocr INFO: name : DistillationDBPostProcess [2024/05/09 14:55:16] ppocr INFO: thresh : 0.3 [2024/05/09 14:55:16] ppocr INFO: unclip_ratio : 1.5 [2024/05/09 14:55:16] ppocr INFO: Train : [2024/05/09 14:55:16] ppocr INFO: dataset : [2024/05/09 14:55:16] ppocr INFO: data_dir : ./train_data/ [2024/05/09 14:55:16] ppocr INFO: label_file_list : ['./train_data/det/train.txt'] [2024/05/09 14:55:16] ppocr INFO: name : SimpleDataSet [2024/05/09 14:55:16] ppocr INFO: ratio_list : [1.0] [2024/05/09 14:55:16] ppocr INFO: transforms : [2024/05/09 14:55:16] ppocr INFO: DecodeImage : [2024/05/09 14:55:16] ppocr INFO: channel_first : False [2024/05/09 14:55:16] ppocr INFO: img_mode : BGR [2024/05/09 14:55:16] ppocr INFO: DetLabelEncode : None [2024/05/09 14:55:16] ppocr INFO: CopyPaste : None [2024/05/09 14:55:16] ppocr INFO: IaaAugment : [2024/05/09 14:55:16] ppocr INFO: augmenter_args : [2024/05/09 14:55:16] ppocr INFO: args : [2024/05/09 14:55:16] ppocr INFO: p : 0.5 [2024/05/09 14:55:16] ppocr INFO: type : Fliplr [2024/05/09 14:55:16] ppocr INFO: args : [2024/05/09 14:55:16] ppocr INFO: rotate : [-10, 10] [2024/05/09 14:55:16] ppocr INFO: type : Affine [2024/05/09 14:55:16] ppocr INFO: args : [2024/05/09 14:55:16] ppocr INFO: size : [0.5, 3] [2024/05/09 14:55:16] ppocr INFO: type : Resize [2024/05/09 14:55:16] ppocr INFO: EastRandomCropData : [2024/05/09 14:55:16] ppocr INFO: keep_ratio : True [2024/05/09 14:55:16] ppocr INFO: max_tries : 50 [2024/05/09 14:55:16] ppocr INFO: size : [960, 960] [2024/05/09 14:55:16] ppocr INFO: MakeBorderMap : [2024/05/09 14:55:16] ppocr INFO: shrink_ratio : 0.4 [2024/05/09 14:55:16] ppocr INFO: thresh_max : 0.7 [2024/05/09 14:55:16] ppocr INFO: thresh_min : 0.3 [2024/05/09 14:55:16] ppocr INFO: MakeShrinkMap : [2024/05/09 14:55:16] ppocr INFO: min_text_size : 8 [2024/05/09 14:55:16] ppocr INFO: shrink_ratio : 0.4 [2024/05/09 14:55:16] ppocr INFO: NormalizeImage : [2024/05/09 14:55:16] ppocr INFO: mean : [0.485, 0.456, 0.406] [2024/05/09 14:55:16] ppocr INFO: order : hwc [2024/05/09 14:55:16] ppocr INFO: scale : 1./255. [2024/05/09 14:55:16] ppocr INFO: std : [0.229, 0.224, 0.225] [2024/05/09 14:55:16] ppocr INFO: ToCHWImage : None [2024/05/09 14:55:16] ppocr INFO: KeepKeys : [2024/05/09 14:55:16] ppocr INFO: keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] [2024/05/09 14:55:16] ppocr INFO: loader : [2024/05/09 14:55:16] ppocr INFO: batch_size_per_card : 8 [2024/05/09 14:55:16] ppocr INFO: drop_last : False [2024/05/09 14:55:16] ppocr INFO: num_workers : 4 [2024/05/09 14:55:16] ppocr INFO: shuffle : True [2024/05/09 14:55:16] ppocr INFO: profiler_options : None [2024/05/09 14:55:16] ppocr INFO: train with paddle 2.6.1 and device Place(cpu) [2024/05/09 14:55:16] ppocr INFO: Initialize indexs of datasets:['./train_data/det/train.txt'] [2024/05/09 14:55:16] ppocr INFO: Initialize indexs of datasets:['./train_data/det/val.txt'] [2024/05/09 14:55:18] ppocr INFO: train dataloader has 3 iters [2024/05/09 14:55:18] ppocr INFO: valid dataloader has 6 iters [2024/05/09 14:55:18] ppocr INFO: load pretrain successful from ./pretrain_models/ch_PP-OCRv3_det_distill_train/ch_PP-OCRv3_det_distill_train/best_accuracy [2024/05/09 14:55:18] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 400 iterations

UserWangZz commented 5 months ago

你好，请问使用的是cpu版本的paddle还是gpu版本的paddle

Lijian500 commented 4 months ago

你好，请问使用的是cpu版本的paddle还是gpu版本的paddle

你好，是cpu版本。

后面我强制关闭了训练程序，但在outinput目录下查看时，发现生成了对应的模型文件，也许是我当时数据量太小（30张图片），瞬间就完成了，所以没有产生日志？

UserWangZz commented 4 months ago

batchsize=8，30张图，一个epoch4个iteration，log_smooth_window : 20,所以应该5个epoch才会打印出一个log。可能与这个有关系

PaddlePaddle / PaddleOCR

新人求解，训练模型时，在打印完日志‘During the training process, after the 0th iteration, an evaluation is run every 400 iterations’后，卡住了 #12083