[BUG] ERROR: OSError: 从文件 samples/XXX 加载文档时出错：cannot find loader for this WMF file

lhtpluto commented 6 months ago

问题描述 / Problem Description 用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner. 知识库文件上传的文件，添加至向量库时发生

V12.2.docx 大小约36MB

dosubot的回答时文档中一个图片的问题，但实际上在知识库问答中，完全没有这个文档的任何内容。

实际结果 / Actual Result 描述实际发生的结果 / Describe the actual result.

2024-02-16 00:58:46,804 - utils.py[line:295] - INFO: RapidOCRDocLoader used for /home/test/Langchain-Chatchat/knowledge_base/samples/content/V12.2.docx RapidOCRDocLoader block index: 352: 46%|█████████████████████████████████████████████████████████████████████████████████████████████▊ | 352/765 [00:01<00:01, 363.24it/s]2024-02-16 00:58:48,594 - utils.py[line:377] - ERROR: OSError: 从文件 samples/V12.2.docx 加载文档时出错：cannot find loader for this WMF file RapidOCRDocLoader block index: 352: 46%|█████████████████████████████████████████████████████████████████████████████████████████████▊ | 352/765 [00:01<00:01, 254.00it/s] 2024-02-16 00:58:48,713 - faiss_cache.py[line:38] - INFO: 已将向量库 ('samples', 'bge-large-zh-v1.5') 保存到磁盘

环境信息 / Environment Information

langchain-ChatGLM 版本/commit 号：publish 0.2.10
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）：ChatGLM3-6B / Qwen-14B-Chat-int4
使用的 Embedding 模型（moka-ai/m3e-base 等）：bge-large-zh-v1.5
使用的向量库类型 (faiss / milvus / pg_vector 等)：默认
操作系统及版本 / Operating system and version: win11 WSL2和Ubuntu 22.04.3 LTS WIN和LINUX都出现问题
Python 版本 / Python version:3.11
其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue. 操作系统：Linux-6.5.0-17-generic-x86_64-with-glibc2.35. python版本：3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] 项目版本：v0.2.10 langchain版本：0.0.354. fastchat版本：0.2.35

当前使用的分词器：ChineseRecursiveTextSplitter 当前启动的LLM模型：['Qwen1.5-14B-Chat-GPTQ-Int4'] @ cuda {'device': 'cuda', 'host': '0.0.0.0', 'infer_turbo': False, 'model_path': 'Qwen1.5-14B-Chat-GPTQ-Int4', 'model_path_exists': True, 'port': 20002} 当前Embbedings模型： bge-large-zh-v1.5 @ cuda

服务端运行信息： OpenAI API Server: http://127.0.0.1:20000/v1 Chatchat API Server: http://127.0.0.1:7861 Chatchat WEBUI Server: http://0.0.0.0:80

dosubot[bot] commented 6 months ago

🤖

嘿，@lhtpluto！很高兴再次见到你。希望你一切都好！

您遇到的 "无法找到此 WMF 文件的加载器" 的 OSError 可能是由于在 RapidOCRDocLoader 类中使用的 Python Imaging Library (PIL) 不支持 Windows Metafile Format (WMF)。

在代码中，RapidOCRDocLoader 类试图使用 PIL 的 Image.open() 函数从 .docx 文件中打开一个图像。如果图像是 WMF 格式的，PIL 将无法打开它，并会抛出一个 OSError。

这是相关的代码：

if isinstance(part, ImagePart):
    image = Image.open(BytesIO(part._blob))
    result, _ = ocr(np.array(image))
    if result:
        ocr_result = [line[1] for line in result]
        resp += "\n".join(ocr_result)

要解决这个问题，您可以在用 Image.open() 尝试打开它们之前，将 WMF 图像转换为 PIL 可以处理的格式，如 PNG 或 JPEG。或者，您可以使用支持 WMF 的库，如 Python 库 PyWMF。

源代码：

document_loaders/mydocloader.py

如果您有任何其他问题或需要进一步的帮助，请随时告诉我。

Sources

document_loaders/mydocloader.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

charleybin commented 6 months ago

The same error occurred.

SongJayJay commented 5 months ago

我也遇到上传pptx和docx报出 cannot fild loader for this WMF file

YiLin198 commented 5 months ago

same error for a docx file

MrL-Leandro commented 4 months ago

🤖

嘿，@lhtpluto！很高兴再次见到你。希望你一切都好！

您遇到的 "无法找到此 WMF 文件的加载器" 的 OSError 可能是由于在 RapidOCRDocLoader 类中使用的 Python Imaging Library (PIL) 不支持 Windows Metafile Format (WMF)。

在代码中，RapidOCRDocLoader 类试图使用 PIL 的 Image.open() 函数从 .docx 文件中打开一个图像。如果图像是 WMF 格式的，PIL 将无法打开它，并会抛出一个 OSError。

这是相关的代码：
if isinstance(part, ImagePart):
    image = Image.open(BytesIO(part._blob))
    result, _ = ocr(np.array(image))
    if result:
        ocr_result = [line[1] for line in result]
        resp += "\n".join(ocr_result)
要解决这个问题，您可以在用 Image.open() 尝试打开它们之前，将 WMF 图像转换为 PIL 可以处理的格式，如 PNG 或 JPEG。或者，您可以使用支持 WMF 的库，如 Python 库 PyWMF。

源代码：

document_loaders/mydocloader.py

如果您有任何其他问题或需要进一步的帮助，请随时告诉我。

Sources

document_loaders/mydocloader.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

在这个代码加个try-catch，wtf格式的直接跳过不要了，可以凑合跑

tehcia000 commented 3 months ago

有大佬贴下这个解决这个问题的代码吗？我加的代码不生效啊

MrL-Leandro commented 3 months ago

if isinstance(part, ImagePart):
    try:
        image = Image.open(BytesIO(part._blob))
        result, _ = ocr(np.array(image))
        if result:
            ocr_result = [line[1] for line in  #result]
            resp += "\n".join(ocr_result)
    except Exception as e:
        print(f"图片无法识别:{e}")

document_loaders\mydocloader.py line49

有大佬贴下这个解决这个问题的代码吗？我加的代码不生效啊

tehcia000 commented 3 months ago

if isinstance(part, ImagePart):
    try:
        image = Image.open(BytesIO(part._blob))
        result, _ = ocr(np.array(image))
        if result:
            ocr_result = [line[1] for line in  #result]
            resp += "\n".join(ocr_result)
    except Exception as e:
        print(f"图片无法识别:{e}")

document_loaders\mydocloader.py line49

有大佬贴下这个解决这个问题的代码吗？我加的代码不生效啊

收到，谢谢！

billfelix commented 3 months ago

收到，谢谢您的来信。我将及时回复您！祝好！

chatchat-space / Langchain-Chatchat

[BUG] ERROR: OSError: 从文件 samples/XXX 加载文档时出错：cannot find loader for this WMF file #2981

Sources

Sources

About Dosu