PGNet训练不成功是什么原因呢？谁能帮忙看看，问题都写在里面啦

qqqqq127 commented 2 years ago

[2022/05/23 17:01:47] ppocr INFO: save best model is to ./output/pgnet_r50_vd_totaltext/best_accuracy 471 [2022/05/23 17:01:47] ppocr INFO: best metric, f_score_e2e: 0, total_num_gt: 2798, total_num_det: 0, global_accumulative_r ecall: 0, hit_str_count: 0, recall: 0.0, precision: 0, f_score: 0, seqerr: 1, recall_e2e: 0.0, precision_e2e: 0, fps: 8.60 4433279775579, best_epoch: 21 472 [2022/05/23 17:01:50] ppocr ERROR: When parsing line 576, error happened with msg: list index out of range 473 [2022/05/23 17:01:57] ppocr INFO: epoch: [21/200], global_step: 2010, lr: 0.001000, loss: 1.278212, score_loss: 0.999998, border_loss: 0.140085, direction_loss: 0.111812, ctc_loss: 0.000000, avg_reader_cost: 0.53334 s, avg_batch_cost: 0.98982 s , avg_samples: 16.0, ips: 16.16456 samples/s, eta: 8:47:38

_评估的时候所有指标都是0，_

802 ^Meval model:: 0%| | 0/246 [00:00<?, ?it/s]Traceback (most recent call last): 803 File "tools/train.py", line 188, in 804 main(config, device, logger, vdl_writer) 805 File "tools/train.py", line 161, in main 806 program.train(config, train_dataloader, valid_dataloader, device, model, 807 File "/data/ocr/PaddleOCR-release-2.5/tools/program.py", line 339, in train 808 cur_metric = eval( 809 File "/data/ocr/PaddleOCR-release-2.5/tools/program.py", line 465, in eval 810 post_result = post_process_class(preds, batch_numpy[1]) 811 File "/data/ocr/PaddleOCR-release-2.5/ppocr/postprocess/pg_postprocess.py", line 49, in call 812 data = post.pg_postprocess_fast() 813 File "/data/ocr/PaddleOCR-release-2.5/ppocr/utils/e2e_utils/pgnet_pp_utils.py", line 56, in pg_postprocess_fast 814 instance_yxs_list, seq_strs = generate_pivot_list_fast( 815 File "/data/ocr/PaddleOCR-release-2.5/ppocr/utils/e2e_utils/extract_textpoint_fast.py", line 381, in generate_pivot_list _fast 816 pos_list_sorted = sort_and_expand_with_direction_v2( 817 File "/data/ocr/PaddleOCR-release-2.5/ppocr/utils/e2e_utils/extract_textpoint_fast.py", line 241, in sort_and_expand_wit h_direction_v2 818 int((left_average_len + right_average_len) / 2.0 * 0.15), 1) 819 OverflowError: cannot convert float infinity to integer 820 ^Meval model:: 0%| | 0/246 [00:01<?, ?it/s]terminate called without an active exception 821 822 823 -------------------------------------- 824 C++ Traceback (most recent call last): 825 -------------------------------------- 826 No stack trace in paddle, may be caused by external reasons. 827 828 ---------------------- 829 Error Message Summary: 830 ---------------------- 831 FatalError: Process abort signal is detected by the operating system. 832 [TimeInfo: Aborted at 1653299690 (unix time) try "date -d @1653299690" if you are using GNU date ] 833 [SignalInfo: SIGABRT (@0xb806) received by PID 47110 (TID 0x7fb2e1008700) from PID 47110 ] 834

**第二次评估还报错了，ctc_loss都是0.

数据集是我自己标注的，标签如下**

SXB_train/img_001.jpg [{"transcription": "料：100%聚酯纤维", "points": [[157, 618], [498, 583], [502, 623], [161, 658]], "difficult": false}, {"transcription": "填充料：100%聚酯纤维", "points": [[165, 665], [499, 633], [503, 674], [169, 705]], "difficult": false}, {"transcription": "填充量：1000g", "points": [[171, 715], [398, 698], [401, 735], [174, 753]], "difficult": false}, {"transcription": "执行标准：GB/T22796-2009", "points": [[175, 769], [549, 725], [554, 762], [179, 805]], "difficult": false}, {"transcription": "安全类别：GB18401-2010C类", "points": [[177, 820], [579, 769], [585, 809], [182, 860]], "difficult": false}, {"transcription": "品名：被芯", "points": [[153, 506], [366, 491], [371, 535], [155, 555]], "difficult": false}, {"transcription": "规格：150x200cm", "points": [[154, 567], [466, 538], [471, 576], [159, 607]], "difficult": false}]

配置文件如下： 1 Global: 2 use_gpu: True 3 epoch_num: 200 4 log_smooth_window: 20 5 print_batch_step: 10 6 save_model_dir: ./output/pgnet_r50_vd_totaltext/ 7 save_epoch_step: 500 8 # evaluation is run every 0 iterationss after the 1000th iteration 9 eval_batch_step: [ 0, 2000 ] 10 cal_metric_during_train: False 11 pretrained_model: 12 checkpoints: 13 save_inference_dir: 14 use_visualdl: False 15 infer_img: 16 valid_set: partvgg # two mode: totaltext valid curved words, partvgg valid non-curved words 17 save_res_path: ./output/pgnet_r50_vd_totaltext/predicts_pgnet.txt 18 character_dict_path: ppocr/utils/ppocr_keys_v1.txt 19 character_type: CH 20 max_text_length: 50 # the max length in seq 21 max_text_nums: 30 # the max seq nums in a pic 22 tcl_len: 64 23 24 Architecture: 25 model_type: e2e 26 algorithm: PGNet 27 Transform: 28 Backbone: 29 name: ResNet 30 layers: 50 31 Neck: 32 name: PGFPN 33 Head: 34 name: PGHead 36 Loss: 37 name: PGLoss 38 tcl_bs: 64 39 max_text_length: 50 # the same as Global: max_text_length 40 max_text_nums: 30 # the same as Global：max_text_nums 41 pad_num: 36 # the length of dict for pad 42 43 Optimizer: 44 name: Adam 45 beta1: 0.9 46 beta2: 0.999 47 lr: 48 learning_rate: 0.001 49 regularizer: 50 name: 'L2' 51 factor: 0 66 Train: 67 dataset: 68 name: PGDataSet 69 data_dir: ./train_data/ 70 label_file_list: [./train_data/train.txt] 71 ratio_list: [1.0] 72 transforms: 73 - DecodeImage: # load image 74 img_mode: BGR 75 channel_first: False 76 - E2ELabelEncodeTrain: 77 - PGProcessTrain: 78 batch_size: 14 # same as loader: batch_size_per_card 79 min_crop_size: 24 80 min_text_size: 4 81 max_text_size: 512 82 - KeepKeys: 83 keep_keys: [ 'images', 'tcl_maps', 'tcl_label_maps', 'border_maps','direction_maps', 'training_masks', 'label_li st', 'pos_list', 'pos_mask' ] # dataloader will return list in this order 84 loader: 85 shuffle: True 86 drop_last: True 87 batch_size_per_card: 16 88 num_workers: 0 90 Eval: 91 dataset: 92 name: PGDataSet 93 data_dir: ./train_data/ 94 label_file_list: [./train_data/test.txt] 95 transforms: 96 - DecodeImage: # load image 97 img_mode: BGR 98 channel_first: False 99 - E2ELabelEncodeTest: 100 - E2EResizeForTest: 101 max_side_len: 768 102 - NormalizeImage: 103 scale: 1./255. 104 mean: [ 0.485, 0.456, 0.406 ] 105 std: [ 0.229, 0.224, 0.225 ] 106 order: 'hwc' 107 - ToCHWImage: 108 - KeepKeys: 109 keep_keys: [ 'image', 'shape', 'polys', 'texts', 'ignore_tags', 'img_id'] 110 loader: 111 shuffle: False 112 drop_last: False 113 batch_size_per_card: 1 # must be 1 114 num_workers: 0

LDOUBLEV commented 2 years ago

818 int((left_average_len + right_average_len) / 2.0 * 0.15), 1) 819 OverflowError: cannot convert float infinity to integer

检查下你这里的数据https://github.com/PaddlePaddle/PaddleOCR/blob/d8a8ca81e1b17a9bf04618baa42c56c7c86931d3/ppocr/utils/e2e_utils/extract_textpoint_fast.py#L240

是不是出现了 infinity 数据

qqqqq127 commented 2 years ago

818 int((left_average_len + right_average_len) / 2.0 * 0.15), 1) 819 OverflowError: cannot convert float infinity to integer

检查下你这里的数据

https://github.com/PaddlePaddle/PaddleOCR/blob/d8a8ca81e1b17a9bf04618baa42c56c7c86931d3/ppocr/utils/e2e_utils/extract_textpoint_fast.py#L240

是不是出现了 infinity 数据

[2022/05/25 15:50:39] ppocr INFO: best metric, f_score_e2e: 0, total_num_gt: 2798, total_num_det: 2244, global_accumulati ve_recall: 1277.3999999999976, hit_str_count: 0, recall: 0.45654038598999197, precision: 0.5870766488413544, f_score: 0.5 136447392525687, seqerr: 1.0, recall_e2e: 0.0, precision_e2e: 0.0, fps: 8.88808927959238, best_epoch: 84 3923 [2022/05/25 15:50:43] ppocr INFO: epoch: [84/200], global_step: 2010, lr: 0.001000, loss: 0.183176, score_loss: 0.125891, border_loss: 0.032234, direction_loss: 0.028453, ctc_loss: 0.000000, avg_reader_cost: 0.00045 s, avg_batch_cost: 0.40427 s, avg_samples: 4.0, ips: 9.89444 samples/s, eta: 0:22:12 为什么e2e的指标全是0呢？训练的时候ctc_loss也是0

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

PaddlePaddle / PaddleOCR

PGNet训练不成功是什么原因呢？谁能帮忙看看，问题都写在里面啦 #6385