PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
44.25k stars 7.82k forks source link

中文表格训练报错: File "/home/aistudio/PaddleOCR/ppocr/data/pubtab_dataset.py", line 117, in __getitem__ outs = transform(data, self.ops) IndexError: list index out of range #13796

Closed freezehe closed 2 months ago

freezehe commented 2 months ago

🔎 Search before asking

🐛 Bug (问题描述)

中文表格训练的时候报这个错:18], [1250, 18], [ File "/home/aistudio/PaddleOCR/ppocr/data/pubtab_dataset.py", line 117, in getitem outs = transform(data, self.ops) File "/home/aistudio/PaddleOCR/ppocr/data/imaug/init.py", line 72, in transform data = op(data) File "/home/aistudio/PaddleOCR/ppocr/data/imaug/label_ops.py", line 731, in call if "bbox" in cells[bbox_idx] and len(cells[bbox_idx]["tokens"]) > 0: IndexError: list index out of range image

🏃‍♂️ Environment (运行环境)

aiofiles==23.2.1 aiohttp==3.9.5 aiosignal==1.3.1 aistudio-sdk @ file:///home/aistudio/aistudio_sdk-0.2.4-py3-none-any.whl#sha256=d93411cc8764e465860cbf2f97f787dddd1548595d4776c97ddf0ea787dedd81 albucore==0.0.14 albumentations==1.4.10 altair==4.2.2 annotated-types==0.6.0 anyio==4.3.0 astor==0.8.1 asttokens==2.4.1 async-timeout==4.0.3 attrs==23.2.0 Babel==2.14.0 bce-python-sdk==0.9.6 beautifulsoup4==4.12.3 blinker==1.7.0 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 coloredlogs==15.0.1 colorlog==6.8.2 comm==0.2.2 contourpy==1.2.1 cycler==0.12.1 Cython==3.0.11 datasets==2.19.0 debugpy==1.8.1 decorator==5.1.1 dill==0.3.4 easydict==1.13 entrypoints==0.4 exceptiongroup==1.2.1 executing==2.0.1 fastapi==0.110.2 ffmpy==0.3.2 filelock==3.13.4 fire==0.6.0 Flask==3.0.3 flask-babel==4.0.0 flatbuffers==24.3.25 fonttools==4.51.0 frozenlist==1.4.1 fsspec==2024.3.1 future==1.0.0 gitdb==4.0.11 GitPython==3.1.43 gradio==3.40.0 gradio_client==0.15.1 gunicorn==22.0.0 h11==0.14.0 httpcore==1.0.5 httpx==0.27.0 huggingface-hub==0.22.2 humanfriendly==10.0 idna==3.7 imageio==2.35.1 imgaug==0.4.0 importlib_metadata==7.1.0 importlib_resources==6.4.0 ipykernel==6.29.4 ipython==8.23.0 itsdangerous==2.2.0 jedi==0.19.1 jieba==0.42.1 Jinja2==3.1.3 joblib==1.4.0 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 jupyter_client==8.6.1 jupyter_core==5.7.2 kiwisolver==1.4.5 lazy_loader==0.4 linkify-it-py==2.0.3 lmdb==1.5.1 lxml==5.3.0 markdown-it-py==2.2.0 MarkupSafe==2.1.5 matplotlib==3.8.4 matplotlib-inline==0.1.7 mdit-py-plugins==0.3.3 mdurl==0.1.1 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.12.2 nest-asyncio==1.6.0 networkx==3.3 numpy==1.26.4 onnx==1.16.0 onnxruntime==1.17.3 opencv-contrib-python==4.10.0.84 opencv-python==4.9.0.80 opencv-python-headless==4.10.0.84 opt-einsum==3.3.0 orjson==3.10.1 packaging==24.0 paddle2onnx==1.2.1 paddlefsl==1.1.0 paddlehub==2.4.0 paddlenlp==2.6.1.post0 paddleocr==2.8.1 paddlepaddle-gpu @ file:///tmp/paddlepaddle_gpu-2.5.2-cp310-cp310-linux_x86_64.whl#sha256=2b4a84c853c7c88ddf4984c667bfcb824cc8a28a674448099452f50c686cc1bb pandas==2.2.2 parso==0.8.4 pexpect==4.9.0 pickleshare==0.7.5 pillow==10.3.0 platformdirs==4.2.0 prettytable==3.10.0 prompt-toolkit==3.0.43 protobuf==3.20.3 psutil==5.9.8 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==16.0.0 pyarrow-hotfix==0.6 pybind11==2.12.0 pyclipper==1.3.0.post5 pycryptodome==3.20.0 pydantic==2.7.0 pydantic_core==2.18.1 pydeck==0.9.1 pydub==0.25.1 Pygments==2.17.2 Pympler==1.0.1 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-docx==1.1.2 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 pyzmq==26.0.2 rapidfuzz==3.9.6 rarfile==4.2 referencing==0.34.0 requests==2.31.0 rich==13.7.1 rpds-py==0.18.0 ruff==0.4.1 safetensors==0.4.3 scikit-image==0.24.0 scikit-learn==1.4.2 scipy==1.13.0 semantic-version==2.10.0 semver==3.0.2 sentencepiece==0.2.0 seqeval==1.2.2 shapely==2.0.6 shellingham==1.5.4 six==1.16.0 smmap==5.0.1 sniffio==1.3.1 soupsieve==2.6 stack-data==0.6.3 starlette==0.37.2 streamlit==1.13.0 streamlit-image-comparison==0.0.4 sympy==1.12 termcolor==2.4.0 threadpoolctl==3.4.0 tifffile==2024.8.28 toml==0.10.2 tomlkit==0.12.0 tool-helpers==0.1.1 toolz==0.12.1 tornado==6.4 tqdm==4.66.2 traitlets==5.14.3 typer==0.12.3 typing_extensions==4.11.0 tzdata==2024.1 tzlocal==5.2 uc-micro-py==1.0.3 urllib3==2.2.1 uvicorn==0.29.0 validators==0.28.3 visualdl==2.5.3 watchdog==4.0.1 wcwidth==0.2.13 websockets==11.0.3 Werkzeug==3.0.2 xxhash==3.4.1 yarl==1.9.4 zipp==3.19.2

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

, error happened with msg: Traceback (most recent call last):me": "280.jpg", "html": {"structure": {"tokens": ["", "", "<td", " rowspan=\"2\"", ">", "", "<td", " colspan=\"2\"", ">", "", "<td", " rowspan=\"2\"", ">", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "<td", " rowspan=\"9\"", ">", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", File "/home/aistudio/PaddleOCR/ppocr/data/pubtab_dataset.py", line 117, in getitem outs = transform(data, self.ops) File "/home/aistudio/PaddleOCR/ppocr/data/imaug/init.py", line 72, in transform data = op(data) File "/home/aistudio/PaddleOCR/ppocr/data/imaug/label_ops.py", line 731, in call if "bbox" in cells[bbox_idx] and len(cells[bbox_idx]["tokens"]) > 0: IndexError: list index out of range

BotAndyGao commented 2 months ago

你的训练图片应该也有能训练的吧,只是一些图片出现这个问题是吧?你当时使用的PPOCRLabel的版本是多少?这个问题应该是表格标注导出时生成gt.txt文件中图片structure中的单元格与cells里的数据不匹配。解决这个问题你可以尝试做两件事1、把这些出问题的图片的Excel中表结构以外的行和列多选中一些,删除掉。因为模型识别时识别出的文字或表格行列多了,你调整小了后,那些没有删除的行列在导出读取excel时会读出来,这样会导致行列数量多余cells里的数。2、把PPOCRLabel升级到现在main分支,把所有的标注图片重新导出表格标注。