PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.54k stars 7.77k forks source link

制作的rec数据集训练报错 IndexError: list index out of range #10878

Closed alanxinn closed 1 year ago

alanxinn commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

我们提供了AceIssueSolver来帮助你解答问题,你是否想要它来解答(请填写yes/no)?/We provide AceIssueSolver to solve issues, do you want it? (Please write yes/no):

label文件已经将图片路径和图像内容使用\t进行分割了 但还是会报错

DingHsun commented 11 months ago

請問我也遇到相同錯誤,

data['ext_data'] = self.get_ext_data()
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 124, in get_ext_data
label = substr[1]
IndexError: list index out of range

請問後續是如何解決的?

alanxinn commented 11 months ago

請問我也遇到相同錯誤,

data['ext_data'] = self.get_ext_data()
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 124, in get_ext_data
label = substr[1]
IndexError: list index out of range

請問後續是如何解決的?

我忘记了,好像就是因为数据集的问题导致的

DingHsun commented 11 months ago

@alanxinn 找到問題的解決方法了,PPOCRLabel標註完後使用gen_ocr_train_val_test.py製作train.txt, val.txt和test.txt。 發現這三個txt檔案多了一個換行導致讀取錯誤,如下

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_76.jpg  192.168.122.255

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_90.jpg  overruns

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_31.jpg  flags=73<UP,LOOPBACK,RUNNING>

我嘗試將空白的行去掉後便能正常執行,更改為

D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_76.jpg  192.168.122.255
D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_90.jpg  overruns
D:\PaddleOCR\train_data\rec\train\FAB01_Terminal_RedHat7.9_crop_31.jpg  flags=73<UP,LOOPBACK,RUNNING>

不過說也奇怪,我只有train.txt報錯,也只修改了train.txt,其餘兩個沒有修改卻能正常訓練,不曉得什麼原因。

xiaozhou0311 commented 10 months ago

023/12/05 15:58:17] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 3 iterations [2023/12/05 15:58:33] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:37] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2720649282331695 [2023/12/05 15:58:37] ppocr INFO: save best model is to ./output/ch_PP-OCR_V3_det/best_accuracy [2023/12/05 15:58:37] ppocr INFO: best metric, hmean: 0.9411764705882353, is_float16: False, precision: 0.8888888888888888, recall: 1.0, fps: 1.2720649282331695, best_epoch: 1 [2023/12/05 15:58:53] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:57] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2869010204678446 为什么加载数据的时候没有出现 label = substr[1],训练的时候就出现了

alanxinn commented 10 months ago

023/12/05 15:58:17] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 3 iterations [2023/12/05 15:58:33] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:37] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2720649282331695 [2023/12/05 15:58:37] ppocr INFO: save best model is to ./output/ch_PP-OCR_V3_det/best_accuracy [2023/12/05 15:58:37] ppocr INFO: best metric, hmean: 0.9411764705882353, is_float16: False, precision: 0.8888888888888888, recall: 1.0, fps: 1.2720649282331695, best_epoch: 1 [2023/12/05 15:58:53] ppocr ERROR: When parsing line

, error happened with msg: Traceback (most recent call last): File "E:\AI_Code\PaddleOCR-2.7.1\ppocr\data\simple_dataset.py", line 150, in getitem label = substr[1] IndexError: list index out of range

[2023/12/05 15:58:57] ppocr INFO: cur metric, precision: 0.8888888888888888, recall: 1.0, hmean: 0.9411764705882353, fps: 1.2869010204678446 为什么加载数据的时候没有出现 label = substr[1],训练的时候就出现了

没有遇到过 不好意思 帮不到你

z3lz commented 8 months ago

使用这个脚本检测一下是否是\t分割就可以了def check_and_fix_tab_separation(file_path): with open(file_path, 'r', encoding='utf-8') as file: lines = file.readlines()

new_lines = []
for line in lines:
    if '\t' not in line:
        # 如果没有找到制表符,则将空格替换为制表符
        line = line.replace(' ', '\t')
    new_lines.append(line)

with open(file_path, 'w', encoding='utf-8') as file:
    file.writelines(new_lines)
Unlicensed-driver-ljx commented 4 months ago

我也遇到了,你们怎么解决的发一下