PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.71k stars 2.86k forks source link

[Bug]: 文档提取的pdf地址带签名报错 #8611

Open 564142183 opened 2 weeks ago

564142183 commented 2 weeks ago

软件环境

- paddlepaddle:
- paddlepaddle-gpu: 2.5.2.post120
- paddlenlp: 2.8.0
- paddleocr: 2.6.1.3

重复问题

错误描述

文件地址不带签名时正常,带上签名后报错
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x779c0c463c90>

稳定复现步骤 & 代码

import paddlenlp, paddleocr
print("paddlenlp:"+paddlenlp.__version__)
print("paddleocr:"+paddleocr.__version__)

from pprint import pprint
from paddlenlp import Taskflow

schema = ["开票金额是多少?", "销方开户银行是什么?", "发票号码是什么?", "开票日期是哪天?"]
ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
pprint(ie({"doc": "https://xfhs-zongdui-dev.oss-cn-beijing.aliyuncs.com/2.pdf?Expires=1718704376&OSSAccessKeyId=TMP.3KhVx59XrNtt8WjorPeXMiPnHbQYGSs1WW4no7qEUnnjeEuZcYv5RbS1sYGCxr1gELgXYrNa4d76JBhWwemPj28MUovcxu&Signature=oXCHTXFoS4LD2lDtZ4Tu7lTFTAU%3D"}))
λ 969010514d8d /PaddleNLP/test python app.py 
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
paddlenlp:2.8.0.post
paddleocr:2.6.1.3
[2024-06-18 00:53:28,019] [    INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load '/root/.paddlenlp/taskflow/information_extraction/uie-x-base'.
Traceback (most recent call last):
  File "/PaddleNLP/test/app.py", line 30, in <module>
    pprint(ie({"doc": "https://xfhs-zongdui-dev.oss-cn-beijing.aliyuncs.com/2.pdf?Expires=1718704376&OSSAccessKeyId=TMP.3KhVx59XrNtt8WjorPeXMiPnHbQYGSs1WW4no7qEUnnjeEuZcYv5RbS1sYGCxr1gELgXYrNa4d76JBhWwemPj28MUovcxu&Signature=oXCHTXFoS4LD2lDtZ4Tu7lTFTAU%3D"}))
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/taskflow.py", line 822, in __call__
    results = self.task_instance(inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/task.py", line 526, in __call__
    inputs = self._preprocess(*args)
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/information_extraction.py", line 605, in _preprocess
    inputs = self._check_input_text(inputs)
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/information_extraction.py", line 634, in _check_input_text
    data = self._parser_map[self._ocr_lang_choice].parse(
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/utils/doc_parser.py", line 51, in parse
    image = self.read_image(doc["doc"])
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/utils/doc_parser.py", line 203, in read_image
    _image = np.array(ImageOps.exif_transpose(Image.open(BytesIO(image_buff)).convert("RGB")))
  File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3305, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x779c0c463c90>
564142183 commented 1 week ago

稳定复现步骤 & 代码

import paddlenlp, paddleocr
print("paddlenlp:"+paddlenlp.__version__)
print("paddleocr:"+paddleocr.__version__)

from pprint import pprint
from paddlenlp import Taskflow

schema = ["开票金额是多少?", "销方开户银行是什么?", "发票号码是什么?", "开票日期是哪天?"]
docprompt = Taskflow("document_intelligence")
pprint(docprompt([{"doc": "./2.pdf", "prompt": schema}]))

报错信息

λ 969010514d8d /PaddleNLP/test python app.py 
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
paddlenlp:2.8.0.post
paddleocr:2.6.1.3
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[2024-06-18 02:58:35,695] [    INFO] - We are using (<class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'>, False) to load 'ernie-layoutx-base-uncased'.
[2024-06-18 02:58:36,221] [    INFO] - tokenizer config file saved in /root/.paddlenlp/models/ernie-layoutx-base-uncased/tokenizer_config.json
[2024-06-18 02:58:36,221] [    INFO] - Special tokens file saved in /root/.paddlenlp/models/ernie-layoutx-base-uncased/special_tokens_map.json
Traceback (most recent call last):
  File "/PaddleNLP/test/app.py", line 10, in <module>
    pprint(docprompt([{"doc": "./2.pdf", "prompt": schema}]))
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/taskflow.py", line 822, in __call__
    results = self.task_instance(inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/task.py", line 526, in __call__
    inputs = self._preprocess(*args)
  File "/usr/local/lib/python3.10/dist-packages/paddlenlp/taskflow/document_intelligence.py", line 90, in _preprocess
    ocr_result = self._ocr.ocr(example["doc"], cls=True)
  File "/usr/local/lib/python3.10/dist-packages/paddleocr/paddleocr.py", line 544, in ocr
    img = check_img(img)
  File "/usr/local/lib/python3.10/dist-packages/paddleocr/paddleocr.py", line 434, in check_img
    img, flag_gif, flag_pdf = check_and_read(image_file)
  File "/usr/local/lib/python3.10/dist-packages/paddleocr/ppocr/utils/utility.py", line 96, in check_and_read
    for pg in range(0, pdf.pageCount):
AttributeError: 'Document' object has no attribute 'pageCount'. Did you mean: 'page_count'?

测试文件

2.pdf