使用UnstructuredWordDocumentLoader，load数据后，loader.load报错

Meteorslxy commented 5 months ago

问题描述 / Problem Description 在conda env 中 python=3.9 执行pip langchain后执行下述代码，报出错误 loader = UnstructuredWordDocumentLoader("***.docx",mode="elements") documents = loader.load()

[nltk_data] Error loading punkt: <urlopen error [Errno 99] Cannot [nltk_data] assign requested address>

但是在相同环境下，执行下述代码，却运行正常 loader = TextLoader('./***.txt') documents = loader.load()

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：langchain 0.1.12(例如：v2.0.1 或 commit 123456) / langchain-ChatGLM version/commit number: (e.g., v2.0.1 or commit 123456)
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：Qwen-14B
使用的 Embedding 模型（moka-ai/m3e-base 等）：moka-ai/m3e-base / Embedding model used (moka-ai/m3e-base, etc.): moka-ai/m3e-base
使用的向量库类型 (faiss / milvus / pg_vector 等)： milvus
操作系统及版本 / Operating system and version: ubuntu 22.04 (run from docker)
Python 版本 / Python version: python=3.9
其他相关环境信息 / Other relevant environment information:

dosubot[bot] commented 5 months ago

你好，@Meteorslxy！很高兴再次见到你。希望你最近一切都好。

您遇到的问题可能是由于在处理Word文档与处理纯文本文件时涉及的操作和依赖项不同。TextLoader可能直接加载文本数据，无需复杂的解析或外部库。

在您的RapidOCRDocLoader中，该过程涉及几个步骤和外部库，这些可能是错误的来源：

使用python-docx解析文档：加载器使用python-docx解析Word文档。错误可能源于文档元素的错误处理，或者文档的结构本身不符合预期。
使用PIL和RapidOCR提取图像和OCR：从文档中提取嵌入的图像并使用OCR处理它们引入了可能的故障点，特别是如果OCR库遇到不支持的图像格式或无法正确初始化。
处理复杂文档结构：您的加载器试图处理文档中的文本和表格元素。在遍历这些元素的逻辑中可能会出现错误，特别是如果文档包含复杂的嵌套结构。
依赖项和环境：使用外部库，如PIL，numpy和rapidocr_onnxruntime，意味着错误也可能源于这些依赖项的问题，如安装问题，版本不兼容，或这些库中的运行时错误。

没有具体的错误消息或追溯，很难确定确切的原因。然而，需要调查的常见问题包括：

依赖项问题：确保所有必需的库都已安装并且彼此兼容。
文档格式：验证您试图加载的Word文档没有损坏，并且与python-docx兼容。
OCR限制：检查OCR过程是否按预期与从您的文档中提取的图像一起工作。一些图像可能不适合OCR，因为分辨率低，布局复杂，或者格式不受支持。
错误处理：在您的doc2text函数中添加更强大的错误处理和日志记录，以捕获和诊断在文档加载过程中抛出的任何异常。

要诊断和修复问题，首先隔离过程的每个部分（文档解析，图像提取，OCR），并确保每个部分都与您的文档单独工作。然后，逐步整合它们，同时在每个步骤中处理可能的错误。

参考源码：

document_loaders/mydocloader.py

Sources

document_loaders/mydocloader.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

Meteorslxy commented 5 months ago

你好，再尝试多种方法后，还是报错。 [nltk_data] Error loading punkt: <urlopen error [Errno 99] Cannot[nltk_data] assign requested address>

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

尝试安装punkt也不可以。

Meteorslxy commented 5 months ago

应该是网络问题，从https://github.com/nltk/nltk_data，下载【packages】，然后把【packages】文件夹名换成：【nltk_data】，放在【/root/nltk_data】，最后把【tokenizers】文件夹下的【punkt.zip】进行解压，即可使用

chatchat-space / Langchain-Chatchat

使用UnstructuredWordDocumentLoader，load数据后，loader.load报错 #3482

Sources