PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.57k stars 7.67k forks source link

文字检测训练阶段报错 #1180

Closed simplew2011 closed 3 years ago

simplew2011 commented 3 years ago

这些参数下,在验证阶段还是报错。自动断掉训练过程。 num_workers': 1 'test_batch_size_per_card': 16 paddle.reader.multiprocess_reader(readers, False, queue_size=200)

train dataset split (with 0.1) in train_num: 90000, val_num: 10000 2020-11-16 14:31:46,544-INFO: {'Global': {'algorithm': 'DB', 'use_gpu': True, 'epoch_num': 100, 'log_smooth_window': 20, 'print_batch_step': 10, 'save_model_dir': './outputs/det', 'save_epoch_step': 1000, 'eval_batch_step': [1000, 5000], 'train_batch_size_per_card': 16, 'test_batch_size_per_card': 16, 'image_shape': [3, 640, 640], 'reader_yml': './configs/det/det_db_icdar15_reader.yml', 'pretrain_weights': './weights/det/ch_ppocr_server_v1.1_det_train/best_accuracy', 'checkpoints': None, 'save_res_path': './outputs/det_db/predicts_db.txt', 'save_inference_dir': None, 'character_type': 'ch'}, 'Architecture': {'function': 'ppocr.modeling.architectures.det_model,DetModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.det_mobilenet_v3,MobileNetV3', 'scale': 0.5, 'model_name': 'large', 'disable_se': True}, 'Head': {'function': 'ppocr.modeling.heads.det_db_head,DBHead', 'model_name': 'large', 'k': 50, 'inner_channels': 96, 'out_channels': 2}, 'Loss': {'function': 'ppocr.modeling.losses.det_db_loss,DBLoss', 'balance_loss': True, 'main_loss_type': 'DiceLoss', 'alpha': 5, 'beta': 10, 'ohem_ratio': 3}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.001, 'beta1': 0.9, 'beta2': 0.999, 'decay': {'function': 'cosine_decay_warmup', 'step_each_epoch': 16, 'total_epoch': 1200}}, 'PostProcess': {'function': 'ppocr.postprocess.db_postprocess,DBPostProcess', 'thresh': 0.3, 'box_thresh': 0.6, 'max_candidates': 1000, 'unclip_ratio': 1.5}, 'TrainReader': {'reader_function': 'ppocr.data.det.dataset_traversal,TrainReader', 'process_function': 'ppocr.data.det.db_process,DBProcessTrain', 'num_workers': 1, 'img_set_dir': '/media/simplew/7eacdff0-e8e3-4de7-accb-39586bc4c9a7/ocr_dataset/plate/CCPD2019/ccpd_base/', 'label_file_path': '/media/simplew/7eacdff0-e8e3-4de7-accb-39586bc4c9a7/ocr_dataset/plate/CCPD2019/det_gt_train.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.det.dataset_traversal,EvalTestReader', 'process_function': 'ppocr.data.det.db_process,DBProcessTest', 'img_set_dir': '/media/simplew/7eacdff0-e8e3-4de7-accb-39586bc4c9a7/ocr_dataset/plate/CCPD2019/ccpd_base/', 'label_file_path': '/media/simplew/7eacdff0-e8e3-4de7-accb-39586bc4c9a7/ocr_dataset/plate/CCPD2019/det_gt_val.txt', 'test_image_shape': [736, 1280]}, 'TestReader': {'reader_function': 'ppocr.data.det.dataset_traversal,EvalTestReader', 'process_function': 'ppocr.data.det.db_process,DBProcessTest', 'infer_img': None, 'img_set_dir': './train_data/icdar2015/text_localization/', 'label_file_path': './train_data/icdar2015/text_localization/test_icdar2015_label.txt', 'do_eval': True}}

2020-11-16 16:21:20,357-INFO: epoch: 1, iter: 5990, 'lr': 0.000843, 'total_loss': 0.417855, 'loss_shrink_maps': 0.182974, 'loss_threshold_maps': 0.199652, 'loss_binary_maps': 0.036522, time: 0.935 2020-11-16 16:21:29,055-INFO: epoch: 1, iter: 6000, 'lr': 0.000842, 'total_loss': 0.426692, 'loss_shrink_maps': 0.188256, 'loss_threshold_maps': 0.199239, 'loss_binary_maps': 0.036843, time: 0.869 2020-11-16 16:21:29,700-INFO: test tackling num:16 2020-11-16 16:21:31,622-INFO: test tackling num:32 。。。 2020-11-16 16:22:32,935-INFO: test tackling num:688 2020-11-16 16:22:35,169-INFO: test tackling num:704

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

BeyondYourself commented 3 years ago

根据interrupted by signal 9: SIGKILL这个终止提示,一般是内存或者你的CPU/GPU不够了,训练的时候监测一下这些指标

LDOUBLEV commented 3 years ago

'test_batch_size_per_card': 1 设置为1 试试

simplew2011 commented 3 years ago

'test_batch_size_per_card': 1 设置为1 试试

'train_batch_size_per_card': 16, 'test_batch_size_per_card': 1, 一样在验证阶段报同样错

simplew2011 commented 3 years ago

根据interrupted by signal 9: SIGKILL这个终止提示,一般是内存或者你的CPU/GPU不够了,训练的时候监测一下这些指标

如何解决呢。训练数据100K.

littletomatodonkey commented 3 years ago

可以参考这个FAQ

image

simplew2011 commented 3 years ago

可以参考这个FAQ

image

num_workers': 1 'test_batch_size_per_card': 1 paddle.reader.multiprocess_reader(readers, False, queue_size=200) # 32/64/128均试过,一样被杀死

simplew2011 commented 3 years ago

train_batch_size_per_card减少为8,可通过验证阶段不报错。 这是一个问题,减少batch_size让训练变慢,还是得看看能否解决。

LDOUBLEV commented 3 years ago

train_batch_size_per_card减少为8,可通过验证阶段不报错。 这是一个问题,减少batch_size让训练变慢,还是得看看能否解决。

等不久后的动态图PaddleOCR版本吧,这个问题会迎刃而解

simplew2011 commented 3 years ago

train_batch_size_per_card减少为8,可通过验证阶段不报错。 这是一个问题,减少batch_size让训练变慢,还是得看看能否解决。

等不久后的动态图PaddleOCR版本吧,这个问题会迎刃而解

大概啥时候上线呢