PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.96k stars 7.8k forks source link

表格结构识别结果指标teds不能复现 #7038

Closed wangyihi closed 2 years ago

wangyihi commented 2 years ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

运行过程:

  1. 讲pubtabnet数据集的val转化为gt的文件,转化过后就是命令行中的test.json文件
  2. 用v2.0的检测识别模型,如上运行指令,得到的teds指标是73.46.而文档中报告的结果是93.23.在这个过程中修改了一个bug。PaddleOCR-release-2.5\ppstructure\table/eval_table.py的原函数: def get_gt_html(gt_structures, gt_contents): end_html = [] td_index = 0 for tag in gt_structures: if '' in tag: if gt_contents[td_index] != []: end_html.extend(gt_contents[td_index]) end_html.append(tag) td_index += 1 else: end_html.append(tag) return ''.join(end_html), end_html 更改过后的函数: def get_gt_html(gt_structures, gt_contents): print("gt_contents--------------",len(gt_contents)) end_html = [] td_index = 0 for tag in gt_structures: if '' in tag: if td_index<len(gt_contents): if gt_contents[td_index] != []: end_html.extend(gt_contents[td_index]) end_html.append(tag) td_index += 1 else: end_html.append(tag) return ''.join(end_html), end_html
  3. 用v3.0的模型会报错。运行命令:CUDA_VISIBLE_DEVICES=0 python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/ch_ppocr_mobile_v2.0_cls_infer --image_dir='' --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True 报错信息: Traceback (most recent call last): File "table/eval_table.py", line 78, in main(args.gt_path,args.image_dir, args) File "table/eval_table.py", line 47, in main pred_html = text_sys(img) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppstructure/table/predict_table.py", line 73, in call structure_res, elapse = self.table_structurer(copy.deepcopy(img)) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppstructure/table/predict_structure.py", line 89, in call preds['structure_probs'] = outputs[1] IndexError: list index out of range output输出是:[[2.0312889e-04 9.9979693e-01]]
andyjiang1116 commented 2 years ago

v3.0模型看你的运行命令 table_model_dir=./inference/ch_ppocr_mobile_v2.0_cls_infer这里模型用错了吧,这个你用成方向分类器了

wangyihi commented 2 years ago

对,v3.0的模型确实写错了,我改正过后:python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True依然报错,错误信息如下:

Traceback (most recent call last): File "table/eval_table.py", line 78, in main(args.gt_path,args.image_dir, args) File "table/eval_table.py", line 47, in main pred_html = text_sys(img) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppstructure/table/predict_table.py", line 105, in call rec_res, elapse = self.text_recognizer(img_crop_list) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/tools/infer/predict_rec.py", line 394, in call rec_result = self.postprocess_op(preds) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppocr/postprocess/rec_postprocess.py", line 104, in call text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppocr/postprocess/rec_postprocess.py", line 71, in decode for text_id in text_index[batch_idx][selection] File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppocr/postprocess/rec_postprocess.py", line 71, in for text_id in text_index[batch_idx][selection] IndexError: list index out of range

WenmuZhou commented 2 years ago

用v3的模型,识别字典用默认的

wangyihi commented 2 years ago

用v3的模型,识别字典用默认的

  1. 我识别字典用了默认的,命令如下:CUDA_VISIBLE_DEVICES=0,1,2 python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt 会爆显存,然后我将max_batch_size改为1,还是爆显存。然后我将图片resize最大(500,600),跑一会儿过后还是爆显存,有,经检查与图片大小没关系了。我用三张卡,运行了一下run_check(),结果如下:

    paddle.utils.run_check() Running verify PaddlePaddle program ... W0803 03:00:07.675694 5727 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.1, Runtime API Version: 11.1 W0803 03:00:07.683493 5727 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4. PaddlePaddle works well on 1 GPU. W0803 03:00:09.219238 5727 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2 W0803 03:00:09.876346 5727 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2 W0803 03:00:09.876381 5727 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0 W0803 03:00:09.876385 5727 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1 W0803 03:00:11.250126 5727 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2. PaddlePaddle works well on 3 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. 用v2.0的模型,加载识别的默认字典,更改max_batch_size,resize图片尺寸等也一样爆显存。但是用v2.0,加载table_dict.txt字典就正常运行,只是teds指标只有74%。麻烦指点一下,谢谢

  1. 用CPU可以跑起来,命令:python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu False --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt. 但是teds更低了,只有71.46.
  2. 我把我的jsonZ转gt的代码放出来,麻烦大家帮我看一下呢,谢谢。代码如下: def data_process(data): img_name = data["filename"] img_path = os.path.join("D:/pubtabnet/val/val", img_name)

    html = data['html']["structure"]['tokens'] html = ["\<html>", "\<body>", "\<table>"] + html + ["\</table>", "\</body>", "\</html>"] tokens = [] bboxes = [] for cell in data['html']["cells"]: if len(cell['tokens']) == 0 or "bbox" not in cell.keys(): continue tokens.append(cell['tokens']) bboxes.append(cell['bbox'])

    label = [html, bboxes, tokens] return img_path, label

if name == "main": datas = {} idx = 0 with jsonlines.open("D:/pubtabnet/PubTabNet_2.0.0.jsonl", "r") as f: for data in f: if data['split'] == 'val': img_path, label = data_process(data) datas[img_path] = label json.dump(datas, open("test.json", "w"), indent=2, ensure_ascii=True)

WenmuZhou commented 2 years ago

试一下2.6的代码看看,文档里提供了详细的命令https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/ppstructure/table/README_ch.md