PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Apache License 2.0
40.32k stars 7.46k forks source link

table_recognition, lack "PubTabNet_2.0.0_train.jsonl" #8728

Closed davidzhr closed 1 year ago

davidzhr commented 1 year ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

我是下载的数据集 PubTabNet 数据集(https://github.com/ibm-aur-nlp/PubTabNet), 数据集中的文件如下:

bml@jupyter-81f6137aa990a1ed-0:~/storage/pubtabnet$ ls -al total 4003998 drwxr-xr-x 5 bml bml 4096 Jul 21 2020 . drwxr-xr-x 7 bml bml 4096 Dec 28 10:24 .. -rw-r--r-- 1 bml bml 4076381813 Jul 8 2020 PubTabNet_2.0.0.jsonl drwxr-xr-x 3 bml bml 430080 Dec 28 09:53 test drwxr-xr-x 2 bml bml 22851584 Jul 21 2020 train drwxr-xr-x 2 bml bml 421888 Jul 21 2020 val

可以看到数据集总确实没有报错的文件, 请问如何处理?

davidzhr commented 1 year ago

@tink2123

我用的配置文件是 SLANet.yml 这个配置文件是两个数据集,train 和 val, 原始的数据库, 我应该如何分配一下, 数据清洗成配置文件需要的格式?

20221228-130938

andyjiang1116 commented 1 year ago

可以根据PubTabNet_2.0.0.jsonl中的split字段进行处理,拆分成train val test三部分

nissansz commented 8 months ago

C: cd C:\F\PaddleOCR-release-2.6 py -3 tools/train.py -c C:/F/SLANet_ch_border.yml -o Global.epoch_num=1 Global.pretrained_model="C:/Users/Administrator/Desktop/tableBorder/best_accuracy" Train.dataset.name='PubTabDataSet' Eval.dataset.name='PubTabDataSet' Train.dataset.data_dir='C:/F/wtw/pubtabnet/val/' Train.dataset.label_file_list=[C:/F/WTW/PubTabNet_2.0.0_val.jsonl] Eval.dataset.data_dir='C:/F/wtw/pubtabnet/val/' Eval.dataset.label_file_list=[C:/F/WTW/PubTabNet_2.0.0_val.jsonl] Train.loader.num_workers=0 Global.use_gpu=True Global.save_epoch_step=2000 Global.character_dict_path='C:/Users/Administrator/Desktop/tableBorder/table_structure_dict_ch_99span.txt' Global.eval_batch_step=[0,2000] Global.print_batch_step=100 Global.save_model_dir="C:/Users/Administrator/Desktop/tableBorder" Train.loader.batch_size_per_card=8 Train.loader.num_workers=0 Eval.loader.batch_size_per_card=8 Eval.loader.num_workers=0 Optimizer.lr.name=Const Optimizer.lr.learning_rate=0.0005

jsonl中单元格坐标和paddle的对不上,导致上面报错。怎么解决?

jjson坐标是4个数,paddlelabel是四个点,8个数

{"imgid": 548625, "html": {"cells": [{"tokens": []}, {"tokens": ["", "W", "e", "a", "n", "i", "n", "g", ""], "bbox": [66, 4, 96, 13]}, {"tokens": ["", "W", "e", "e", "k", " ", "1", "5", ""], "bbox": [131, 4, 160, 13]}, {"tokens": ["", "O", "f", "f", "-", "t", "e", "s", "t", ""], "bbox": [201, 4, 226, 13]}, {"tokens": ["W", "e", "a", "n", "i", "n", "g"], "bbox": [1, 17, 31, 26]}, {"tokens": ["–"], "bbox": [66, 21, 72, 25]}, {"tokens": ["–"], "bbox": [131, 21, 137, 25]}, {"tokens": ["–"], "bbox": [201, 21, 207, 25]}, {"tokens": ["W", "e", "e", "k", " ", "1", "5"], "bbox": [1, 31, 30, 40]}, {"tokens": ["–"], "bbox": [66, 35, 72, 39]}, {"tokens": ["0", ".", "1", "7", " ", "±", " ", "0", ".", "0", "8"], "bbox": [131, 31, 166, 40]}, {"tokens": ["0", ".", "1", "6", " ", "±", " ", "0", ".", "0", "3"], "bbox": [201, 31, 236, 40]}, {"tokens": ["O", "f", "f", "-", "t", "e", "s", "t"], "bbox": [1, 45, 26, 54]}, {"tokens": ["–"], "bbox": [66, 49, 72, 53]}, {"tokens": ["0", ".", "8", "0", " ", "±", " ", "0", ".", "2", "4"], "bbox": [131, 45, 166, 54]}, {"tokens": ["0", ".", "1", "9", " ", "±", " ", "0", ".", "0", "9"], "bbox": [201, 45, 236, 54]}], "structure": {"tokens": ["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]}}, "split": "val", "filename": "PMC5755158_010_01.png"}