制作的rec数据集训练报错 IndexError: list index out of range

alanxinn commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：win10 专业版
版本号/Version：Paddle：paddlepaddle-gpu 2.4.2 PaddleOCR：2.7 问题相关组件/Related components：
运行指令/Command Code：python tools\train.py -c datasets\en_PP-OCRv4_rec_test.yml
完整报错/Complete Error Message：
[2023/09/11 19:49:04] ppocr INFO: Architecture : [2023/09/11 19:49:04] ppocr INFO: Backbone : [2023/09/11 19:49:04] ppocr INFO: name : PPLCNetV3 [2023/09/11 19:49:04] ppocr INFO: scale : 0.95 [2023/09/11 19:49:04] ppocr INFO: Head : [2023/09/11 19:49:04] ppocr INFO: head_list : [2023/09/11 19:49:04] ppocr INFO: CTCHead : [2023/09/11 19:49:04] ppocr INFO: Head : [2023/09/11 19:49:04] ppocr INFO: fc_decay : 1e-05 [2023/09/11 19:49:04] ppocr INFO: Neck : [2023/09/11 19:49:04] ppocr INFO: depth : 2 [2023/09/11 19:49:04] ppocr INFO: dims : 120 [2023/09/11 19:49:04] ppocr INFO: hidden_dims : 120 [2023/09/11 19:49:04] ppocr INFO: kernel_size : [1, 3] [2023/09/11 19:49:04] ppocr INFO: name : svtr [2023/09/11 19:49:04] ppocr INFO: use_guide : True [2023/09/11 19:49:04] ppocr INFO: NRTRHead : [2023/09/11 19:49:04] ppocr INFO: max_text_length : 25 [2023/09/11 19:49:04] ppocr INFO: nrtr_dim : 384 [2023/09/11 19:49:04] ppocr INFO: name : MultiHead [2023/09/11 19:49:04] ppocr INFO: Transform : None [2023/09/11 19:49:04] ppocr INFO: algorithm : SVTR_LCNet [2023/09/11 19:49:04] ppocr INFO: model_type : rec [2023/09/11 19:49:04] ppocr INFO: Eval : [2023/09/11 19:49:04] ppocr INFO: dataset : [2023/09/11 19:49:04] ppocr INFO: data_dir : datasets\ [2023/09/11 19:49:04] ppocr INFO: label_file_list : ['datasets\rec_gt_test_change.txt'] [2023/09/11 19:49:04] ppocr INFO: name : SimpleDataSet [2023/09/11 19:49:04] ppocr INFO: transforms : [2023/09/11 19:49:04] ppocr INFO: DecodeImage : [2023/09/11 19:49:04] ppocr INFO: channel_first : False [2023/09/11 19:49:04] ppocr INFO: img_mode : BGR [2023/09/11 19:49:04] ppocr INFO: MultiLabelEncode : [2023/09/11 19:49:04] ppocr INFO: gtc_encode : NRTRLabelEncode [2023/09/11 19:49:04] ppocr INFO: RecResizeImg : [2023/09/11 19:49:04] ppocr INFO: image_shape : [3, 48, 320] [2023/09/11 19:49:04] ppocr INFO: KeepKeys : [2023/09/11 19:49:04] ppocr INFO: keep_keys : ['image', 'label_ctc', 'label_gtc', 'length', 'valid_ratio'] [2023/09/11 19:49:04] ppocr INFO: loader : [2023/09/11 19:49:04] ppocr INFO: batch_size_per_card : 64 [2023/09/11 19:49:04] ppocr INFO: drop_last : False [2023/09/11 19:49:04] ppocr INFO: num_workers : 8 [2023/09/11 19:49:04] ppocr INFO: shuffle : False [2023/09/11 19:49:04] ppocr INFO: Global : [2023/09/11 19:49:04] ppocr INFO: cal_metric_during_train : True [2023/09/11 19:49:04] ppocr INFO: character_dict_path : ppocr\utils\en_dict.txt [2023/09/11 19:49:04] ppocr INFO: checkpoints : None [2023/09/11 19:49:04] ppocr INFO: debug : False [2023/09/11 19:49:04] ppocr INFO: distributed : False [2023/09/11 19:49:04] ppocr INFO: epoch_num : 100 [2023/09/11 19:49:04] ppocr INFO: eval_batch_step : [0, 1500] [2023/09/11 19:49:04] ppocr INFO: infer_img : doc/imgs_words/ch/word_1.jpg [2023/09/11 19:49:04] ppocr INFO: infer_mode : False [2023/09/11 19:49:04] ppocr INFO: log_smooth_window : 20 [2023/09/11 19:49:04] ppocr INFO: max_text_length : 25 [2023/09/11 19:49:04] ppocr INFO: pretrained_model : datasets\en_PP-OCRv4_rec_train\best_accuracy [2023/09/11 19:49:04] ppocr INFO: print_batch_step : 10 [2023/09/11 19:49:04] ppocr INFO: save_epoch_step : 5 [2023/09/11 19:49:04] ppocr INFO: save_inference_dir : None [2023/09/11 19:49:04] ppocr INFO: save_model_dir : ./output/rec_ppocr_v4_zifu_en_epoch100 [2023/09/11 19:49:04] ppocr INFO: save_res_path : ./output/rec/predicts_ppocrv4.txt [2023/09/11 19:49:04] ppocr INFO: use_gpu : True [2023/09/11 19:49:04] ppocr INFO: use_space_char : True [2023/09/11 19:49:04] ppocr INFO: use_visualdl : False [2023/09/11 19:49:04] ppocr INFO: Loss : [2023/09/11 19:49:04] ppocr INFO: loss_config_list : [2023/09/11 19:49:04] ppocr INFO: CTCLoss : None [2023/09/11 19:49:04] ppocr INFO: NRTRLoss : None [2023/09/11 19:49:04] ppocr INFO: name : MultiLoss [2023/09/11 19:49:04] ppocr INFO: Metric : [2023/09/11 19:49:04] ppocr INFO: ignore_space : False [2023/09/11 19:49:04] ppocr INFO: main_indicator : acc [2023/09/11 19:49:04] ppocr INFO: name : RecMetric [2023/09/11 19:49:04] ppocr INFO: Optimizer : [2023/09/11 19:49:04] ppocr INFO: beta1 : 0.9 [2023/09/11 19:49:04] ppocr INFO: beta2 : 0.999 [2023/09/11 19:49:04] ppocr INFO: lr : [2023/09/11 19:49:04] ppocr INFO: learning_rate : 0.0005 [2023/09/11 19:49:04] ppocr INFO: name : Cosine [2023/09/11 19:49:04] ppocr INFO: warmup_epoch : 5 [2023/09/11 19:49:04] ppocr INFO: name : Adam [2023/09/11 19:49:04] ppocr INFO: regularizer : [2023/09/11 19:49:04] ppocr INFO: factor : 3e-05 [2023/09/11 19:49:04] ppocr INFO: name : L2 [2023/09/11 19:49:04] ppocr INFO: PostProcess : [2023/09/11 19:49:04] ppocr INFO: name : CTCLabelDecode [2023/09/11 19:49:04] ppocr INFO: Train : [2023/09/11 19:49:04] ppocr INFO: dataset : [2023/09/11 19:49:04] ppocr INFO: data_dir : datasets\ [2023/09/11 19:49:04] ppocr INFO: ds_width : False [2023/09/11 19:49:04] ppocr INFO: ext_op_transform_idx : 1 [2023/09/11 19:49:04] ppocr INFO: label_file_list : ['datasets\rec_gt_train_change.txt'] [2023/09/11 19:49:04] ppocr INFO: name : MultiScaleDataSet [2023/09/11 19:49:04] ppocr INFO: transforms : [2023/09/11 19:49:04] ppocr INFO: DecodeImage : [2023/09/11 19:49:04] ppocr INFO: channel_first : False [2023/09/11 19:49:04] ppocr INFO: img_mode : BGR [2023/09/11 19:49:04] ppocr INFO: RecConAug : [2023/09/11 19:49:04] ppocr INFO: ext_data_num : 2 [2023/09/11 19:49:04] ppocr INFO: image_shape : [48, 320, 3] [2023/09/11 19:49:04] ppocr INFO: max_text_length : 25 [2023/09/11 19:49:04] ppocr INFO: prob : 0.5 [2023/09/11 19:49:04] ppocr INFO: RecAug : None [2023/09/11 19:49:04] ppocr INFO: MultiLabelEncode : [2023/09/11 19:49:04] ppocr INFO: gtc_encode : NRTRLabelEncode [2023/09/11 19:49:04] ppocr INFO: KeepKeys : [2023/09/11 19:49:04] ppocr INFO: keep_keys : ['image', 'label_ctc', 'label_gtc', 'length', 'valid_ratio'] [2023/09/11 19:49:04] ppocr INFO: loader : [2023/09/11 19:49:04] ppocr INFO: batch_size_per_card : 64 [2023/09/11 19:49:04] ppocr INFO: drop_last : True [2023/09/11 19:49:04] ppocr INFO: num_workers : 8 [2023/09/11 19:49:04] ppocr INFO: shuffle : True [2023/09/11 19:49:04] ppocr INFO: sampler : [2023/09/11 19:49:04] ppocr INFO: divided_factor : [8, 16] [2023/09/11 19:49:04] ppocr INFO: first_bs : 96 [2023/09/11 19:49:04] ppocr INFO: fix_bs : False [2023/09/11 19:49:04] ppocr INFO: is_training : True [2023/09/11 19:49:04] ppocr INFO: name : MultiScaleSampler [2023/09/11 19:49:04] ppocr INFO: scales : [[320, 32], [320, 48], [320, 64]] [2023/09/11 19:49:04] ppocr INFO: profiler_options : None [2023/09/11 19:49:04] ppocr INFO: train with paddle 2.4.2 and device Place(gpu:0) [2023/09/11 19:49:04] ppocr INFO: Initialize indexs of datasets:['datasets\rec_gt_train_change.txt'] [2023/09/11 19:49:04] ppocr INFO: Initialize indexs of datasets:['datasets\rec_gt_test_change.txt'] [2023/09/11 19:49:05] ppocr INFO: train dataloader has 69 iters [2023/09/11 19:49:05] ppocr INFO: valid dataloader has 34 iters [2023/09/11 19:49:05] ppocr INFO: load pretrain successful from datasets\en_PP-OCRv4_rec_train\best_accuracy [2023/09/11 19:49:05] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 1500 iterations [2023/09/12 10:54:30] ppocr ERROR: When parsing line train/word_228.png marina , error happened with msg: Traceback (most recent call last): File "E:\desktop\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 252, in getitem data['ext_data'] = self.get_ext_data() File "E:\desktop\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 124, in get_ext_data label = substr[1] IndexError: list index out of range

我们提供了AceIssueSolver来帮助你解答问题，你是否想要它来解答(请填写yes/no)?/We provide AceIssueSolver to solve issues, do you want it? (Please write yes/no):

label文件已经将图片路径和图像内容使用\t进行分割了但还是会报错

DingHsun commented 11 months ago

請問我也遇到相同錯誤，

data['ext_data'] = self.get_ext_data()
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 124, in get_ext_data
label = substr[1]
IndexError: list index out of range

請問後續是如何解決的?

alanxinn commented 11 months ago

請問我也遇到相同錯誤，

data['ext_data'] = self.get_ext_data()
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 124, in get_ext_data
label = substr[1]
IndexError: list index out of range

請問後續是如何解決的?

我忘记了，好像就是因为数据集的问题导致的

DingHsun commented 11 months ago

@alanxinn 找到問題的解決方法了，PPOCRLabel標註完後使用gen_ocr_train_val_test.py製作train.txt, val.txt和test.txt。發現這三個txt檔案多了一個換行導致讀取錯誤，如下

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_76.jpg  192.168.122.255

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_90.jpg  overruns

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_31.jpg  flags=73<UP,LOOPBACK,RUNNING>

我嘗試將空白的行去掉後便能正常執行，更改為

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_76.jpg  192.168.122.255
D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_90.jpg  overruns
D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_31.jpg  flags=73<UP,LOOPBACK,RUNNING>

不過說也奇怪，我只有train.txt報錯，也只修改了train.txt，其餘兩個沒有修改卻能正常訓練，不曉得什麼原因。

xiaozhou0311 commented 10 months ago

023/12/05 15:58:17] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 3 iterations [2023/12/05 15:58:33] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:37] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2720649282331695 [2023/12/05 15:58:37] ppocr INFO: save best model is to ./output/ch_PP-OCR_V3_det/best_accuracy [2023/12/05 15:58:37] ppocr INFO: best metric, hmean: 0.9411764705882353, is_float16: False, precision: 0.8888888888888888, recall: 1.0, fps: 1.2720649282331695, best_epoch: 1 [2023/12/05 15:58:53] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:57] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2869010204678446 为什么加载数据的时候没有出现 label = substr[1]，训练的时候就出现了

alanxinn commented 10 months ago

023/12/05 15:58:17] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 3 iterations [2023/12/05 15:58:33] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:37] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2720649282331695 [2023/12/05 15:58:37] ppocr INFO: save best model is to ./output/ch_PP-OCR_V3_det/best_accuracy [2023/12/05 15:58:37] ppocr INFO: best metric, hmean: 0.9411764705882353, is_float16: False, precision: 0.8888888888888888, recall: 1.0, fps: 1.2720649282331695, best_epoch: 1 [2023/12/05 15:58:53] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:57] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2869010204678446 为什么加载数据的时候没有出现 label = substr[1]，训练的时候就出现了

没有遇到过不好意思帮不到你

z3lz commented 8 months ago

使用这个脚本检测一下是否是\t分割就可以了def check_and_fix_tab_separation(file_path): with open(file_path, 'r', encoding='utf-8') as file: lines = file.readlines()

new_lines = []
for line in lines:
    if '\t' not in line:
        # 如果没有找到制表符，则将空格替换为制表符
        line = line.replace(' ', '\t')
    new_lines.append(line)

with open(file_path, 'w', encoding='utf-8') as file:
    file.writelines(new_lines)

Unlicensed-driver-ljx commented 4 months ago

我也遇到了，你们怎么解决的发一下

PaddlePaddle / PaddleOCR

制作的rec数据集训练报错 IndexError: list index out of range #10878