表格结构识别结果指标teds不能复现

wangyihi commented 2 years ago

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Ubuntu 18.04.6 LTS
版本号/Version：Paddle：1.0.2
PaddleOCR： 2.5.0.3
问题相关组件/Related components：ppstructure
运行指令/Command Code：CUDA_VISIBLE_DEVICES=0 python table/eval_table.py --det_model_dir=./inference/en_ppocr_mobile_v2.0_table_det_infer --rec_model_dir=./inference/en_ppocr_mobile_v2.0_table_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True
完整报错/Complete Error Message：不能复现文档报的teds：93.3的结果

运行过程：

讲pubtabnet数据集的val转化为gt的文件，转化过后就是命令行中的test.json文件
用v2.0的检测识别模型，如上运行指令，得到的teds指标是73.46.而文档中报告的结果是93.23.在这个过程中修改了一个bug。PaddleOCR-release-2.5\ppstructure\table/eval_table.py的原函数： def get_gt_html(gt_structures, gt_contents): end_html = [] td_index = 0 for tag in gt_structures: if '' in tag: if gt_contents[td_index] != []: end_html.extend(gt_contents[td_index]) end_html.append(tag) td_index += 1 else: end_html.append(tag) return ''.join(end_html), end_html 更改过后的函数： def get_gt_html(gt_structures, gt_contents): print("gt_contents--------------",len(gt_contents)) end_html = [] td_index = 0 for tag in gt_structures: if '' in tag: if td_index<len(gt_contents): if gt_contents[td_index] != []: end_html.extend(gt_contents[td_index]) end_html.append(tag) td_index += 1 else: end_html.append(tag) return ''.join(end_html), end_html
用v3.0的模型会报错。运行命令：CUDA_VISIBLE_DEVICES=0 python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/ch_ppocr_mobile_v2.0_cls_infer --image_dir='' --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True 报错信息： Traceback (most recent call last): File "table/eval_table.py", line 78, in main(args.gt_path,args.image_dir, args) File "table/eval_table.py", line 47, in main pred_html = text_sys(img) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppstructure/table/predict_table.py", line 73, in call structure_res, elapse = self.table_structurer(copy.deepcopy(img)) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppstructure/table/predict_structure.py", line 89, in call preds['structure_probs'] = outputs[1] IndexError: list index out of range output输出是：[[2.0312889e-04 9.9979693e-01]]

andyjiang1116 commented 2 years ago

v3.0模型看你的运行命令 table_model_dir=./inference/ch_ppocr_mobile_v2.0_cls_infer这里模型用错了吧，这个你用成方向分类器了

wangyihi commented 2 years ago

对，v3.0的模型确实写错了，我改正过后：python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True依然报错，错误信息如下：

Traceback (most recent call last): File "table/eval_table.py", line 78, in main(args.gt_path,args.image_dir, args) File "table/eval_table.py", line 47, in main pred_html = text_sys(img) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppstructure/table/predict_table.py", line 105, in call rec_res, elapse = self.text_recognizer(img_crop_list) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/tools/infer/predict_rec.py", line 394, in call rec_result = self.postprocess_op(preds) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppocr/postprocess/rec_postprocess.py", line 104, in call text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True) File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppocr/postprocess/rec_postprocess.py", line 71, in decode for text_id in text_index[batch_idx][selection] File "/home/ubuntu/xf/tabular_recongnition/PaddleOCR-release-2.5/ppocr/postprocess/rec_postprocess.py", line 71, in for text_id in text_index[batch_idx][selection] IndexError: list index out of range

WenmuZhou commented 2 years ago

用v3的模型，识别字典用默认的

wangyihi commented 2 years ago

用v3的模型，识别字典用默认的

我识别字典用了默认的，命令如下：CUDA_VISIBLE_DEVICES=0,1,2 python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu True --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt 会爆显存，然后我将max_batch_size改为1，还是爆显存。然后我将图片resize最大（500,600），跑一会儿过后还是爆显存，有，经检查与图片大小没关系了。我用三张卡，运行了一下run_check()，结果如下：

paddle.utils.run_check() Running verify PaddlePaddle program ... W0803 03:00:07.675694 5727 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.1, Runtime API Version: 11.1 W0803 03:00:07.683493 5727 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4. PaddlePaddle works well on 1 GPU. W0803 03:00:09.219238 5727 parallel_executor.cc:642] Cannot enable P2P access from 0 to 2 W0803 03:00:09.876346 5727 parallel_executor.cc:642] Cannot enable P2P access from 1 to 2 W0803 03:00:09.876381 5727 parallel_executor.cc:642] Cannot enable P2P access from 2 to 0 W0803 03:00:09.876385 5727 parallel_executor.cc:642] Cannot enable P2P access from 2 to 1 W0803 03:00:11.250126 5727 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2. PaddlePaddle works well on 3 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. 用v2.0的模型，加载识别的默认字典，更改max_batch_size，resize图片尺寸等也一样爆显存。但是用v2.0,加载table_dict.txt字典就正常运行，只是teds指标只有74%。麻烦指点一下，谢谢

用CPU可以跑起来，命令：python table/eval_table.py --det_model_dir=./inference/ch_PP-OCRv3_det_infer --rec_model_dir=./inference/ch_PP-OCRv3_rec_infer --table_model_dir=./inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir='' --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=./test.json --use_gpu False --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt. 但是teds更低了，只有71.46.
我把我的jsonZ转gt的代码放出来，麻烦大家帮我看一下呢，谢谢。代码如下： def data_process(data): img_name = data["filename"] img_path = os.path.join("D:/pubtabnet/val/val", img_name)

html = data['html']["structure"]['tokens'] html = ["\<html>", "\<body>", "\<table>"] + html + ["\</table>", "\</body>", "\</html>"] tokens = [] bboxes = [] for cell in data['html']["cells"]: if len(cell['tokens']) == 0 or "bbox" not in cell.keys(): continue tokens.append(cell['tokens']) bboxes.append(cell['bbox'])

label = [html, bboxes, tokens] return img_path, label

if name == "main": datas = {} idx = 0 with jsonlines.open("D:/pubtabnet/PubTabNet_2.0.0.jsonl", "r") as f: for data in f: if data['split'] == 'val': img_path, label = data_process(data) datas[img_path] = label json.dump(datas, open("test.json", "w"), indent=2, ensure_ascii=True)

WenmuZhou commented 2 years ago

试一下2.6的代码看看，文档里提供了详细的命令https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/ppstructure/table/README_ch.md

PaddlePaddle / PaddleOCR

表格结构识别结果指标teds不能复现 #7038