PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.19k stars 2.95k forks source link

[Bug]: Taskflow("document_intelligence"): Illegal instruction (core dumped) #8466

Closed wencan closed 6 months ago

wencan commented 6 months ago

软件环境

paddle2onnx            1.2.2
paddlefsl              1.1.0
paddlenlp              2.8.0
paddleocr              2.7.3
paddlepaddle           2.6.1

OS: Fedora Linux 40 (Workstation Edition) x86_64 
Host: 21D0 ThinkBook 14 G4+ ARA 
Kernel: 6.8.9-300.fc40.x86_64 
Uptime: 8 hours, 18 mins 
Packages: 2796 (rpm), 36 (flatpak) 
Shell: bash 5.2.26 
Resolution: 2880x1800 
DE: GNOME 46.1 
WM: Mutter 
WM Theme: Adwaita 
Theme: Adwaita [GTK2/3] 
Icons: Adwaita [GTK2/3] 
Terminal: gnome-terminal 
CPU: AMD Ryzen 5 6600H with Radeon Graphics (12) @ 4.564GHz 
GPU: AMD ATI Radeon 680M 
Memory: 4609MiB / 13649MiB 

Python 3.10.14

重复问题

错误描述

见“稳定复现步骤 & 代码”

稳定复现步骤 & 代码

from paddlenlp import Taskflow
docprompt = Taskflow("document_intelligence")

输出:

Illegal instruction (core dumped)

然后python进程自动结束

w5688414 commented 6 months ago

可以降低paddle和paddlenlp版本试试,比如paddle为2.5.2,paddlenlp为2.6

wencan commented 6 months ago

@w5688414 切换到:paddle为2.5.2,paddlenlp为2.6 进入到下载模型文件环节,然后消耗cpu,然后抛出了这个问题:https://github.com/PaddlePaddle/PaddleNLP/issues/3451 paddlenlp与paddocr版本不匹配

建议管控下paddle各项目之间的版本依赖 这类问题我经常遇到

w5688414 commented 6 months ago

好的,感谢反馈,我测了一下下面的环境是可以的,可以参考:

paddle-pipelines               0.6.2
paddle2onnx                    1.1.0
paddlefsl                      1.1.0
paddlenlp                      2.6.1
paddleocr                      2.6.1.3
paddlepaddle                   2.5.2

后续会提供历史模型的稳定复现环境

wencan commented 6 months ago

@w5688414 尝试了paddlepaddle==2.5.2 paddleocr==2.6.1.3 paddlenlp==2.6.1,图片可以,但pdf文档不行 paddleocr==2.6.1.3中paddleocr/ppocr/utils/utility.py:96 的pdfmupdf api已经过时 更新paddleocr到2.7.3,才彻底解决pymupdf的问题 然后遇到 https://github.com/PaddlePaddle/PaddleNLP/issues/3451 ,即:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wencan/Projects/venv/paddlenlp/lib64/python3.10/site-packages/paddlenlp/taskflow/taskflow.py", line 822, in __call__
    results = self.task_instance(inputs, **kwargs)
  File "/home/wencan/Projects/venv/paddlenlp/lib64/python3.10/site-packages/paddlenlp/taskflow/task.py", line 527, in __call__
    outputs = self._run_model(inputs, **kwargs)
  File "/home/wencan/Projects/venv/paddlenlp/lib64/python3.10/site-packages/paddlenlp/taskflow/document_intelligence.py", line 119, in _run_model
    for data in data_loader:
  File "/home/wencan/Projects/venv/paddlenlp/lib64/python3.10/site-packages/paddlenlp/taskflow/utils.py", line 2152, in data_generator
    self.examples[phase] = self.ppocr2example(ocr_res, img_path, querys)
  File "/home/wencan/Projects/venv/paddlenlp/lib64/python3.10/site-packages/paddlenlp/taskflow/utils.py", line 1772, in ppocr2example
    left = min(rst[0][0][0], rst[0][3][0])
IndexError: list index out of range

即使我将paddlenlp挨个版本号尝试,一直升级到2.8.0,也就是现在的最新版本,也没解决

w5688414 commented 6 months ago

感谢反馈,document intelligence相关的技术已经停止更新了,欢迎使用大模型的思路来解决您的业务问题,另外,对于paddleocr版本更新引起的问题,欢迎开发者贡献

wencan commented 6 months ago

@w5688414 大模型的话,一些在线大模型能够比较好的完成这个任务——包括文心一言 但我处理的是用户隐私数据,内容敏感,不能使用在线大模型api 根据评测,一些开源大模型,也能解析一些简单的文档,比如一部txt小说。但pdf结构“反大模型”。 有什么好的“离线”/本地开源大模型pdf解析方案推荐吗?

w5688414 commented 6 months ago

pdf用pdf库或者paddleocr抽取文本后,用大模型处理,通过prompt engineering的方式做摘要,实体抽取等任务。如果效果不好,可以标注一部分数据,做lora微调,微调效果。

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm