PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.18k stars 2.95k forks source link

[Bug]: pipelines中语义检索系统,启动运行后,上传扫描式PDF文件 无法解析 #8418

Closed morego123 closed 4 months ago

morego123 commented 6 months ago

软件环境

paddle-pipelines               0.6.2
paddle2onnx                    1.2.1
paddlefsl                      1.1.0
paddlenlp                      2.8.0
paddleocr                      2.7.3
paddlepaddle-gpu               2.6.0.post117

重复问题

错误描述

INFO:     127.0.0.1:43132 - "POST /file-upload HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/pipelines/base.py", line 446, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(**node_input)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/base.py", line 120, in _dispatch_run
    return self._dispatch_run_general(self.run, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/base.py", line 164, in _dispatch_run_general
    output, stream = run_method(**run_inputs, **run_params)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 144, in run
    output, stream = run_indexing(documents=documents, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 110, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/base.py", line 229, in run_indexing
    embeddings = self.embed_documents(document_objects, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/dense.py", line 367, in embed_documents
    embeddings = self._get_predictions(passages, **kwargs)["passages"]
  File "/usr/local/lib/python3.10/dist-packages/paddle_pipelines-0.6.2-py3.10.egg/pipelines/nodes/retriever/dense.py", line 292, in _get_predictions
    if "passages" in dicts[0]:
IndexError: list index out of range

稳定复现步骤 & 代码

在网页端,左侧文件上传模块,上传扫描式PDF文件 无法解析。上传非扫描件PDF,正常。
对于扫描式PDF文件,是此repo本来无法解析,还是我哪个组件没安装?

w5688414 commented 6 months ago

您好,目前不支持扫描件的PDF,欢迎开发者贡献。

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。