chatchat-space / Langchain-Chatchat

Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain
Apache License 2.0
31.25k stars 5.45k forks source link

[BUG] 知识库导入TXT,出现2种报错 #1136

Closed GPTlei closed 11 months ago

GPTlei commented 1 year ago

问题描述 / Problem Description LLM已经跑起来,知识库也能运行。导入新的TXT时,出现以下2种报错。

报错1: 复现问题的步骤 / Steps to Reproduce

  1. 点击 '...' / Click '...' 添加文件到知识库
  2. 问题出现 / Problem occurs
    page_content='《求是》杂志发表习近平总书记重要文章《全面从严治党探索出依靠党的自我革命跳出历史周期率的成功路径》\n\n来源:新华网 发布时间:2023-01-31\n\n新华社北京1月31日电 2月1日出版的第3期《求是》杂志将发表中共中央总书记、国家主 席、中央军委主席习近平的重要文章《全面从严治党探索出依靠党的自我革命跳出历史周期率的成功路径》。' metadata={'source': 'E:\ai\langchain2\langchain-ChatGLM\knowledge_base\xijinping\content\2023-01-31-《求是》杂志发表习近平总书记重要文 章《全面从严治党探索出依靠党的自我革命跳出历史周期率的成功路径》.txt'} Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 35.71it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.04it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.03it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.67it/s] INFO: 127.0.0.1:65362 - "POST /knowledge_base/upload_doc HTTP/1.1" 200 OK UnstructuredFileLoader [E050] Can't find model 'zh_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

这个报错每个txt都会出现,能继续下条导入。

报错2: INFO: 127.0.0.1:49392 - "POST /knowledge_base/upload_doc HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "E:\ai\langchain2\env\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 408, in run_asgi result = await app( # type: ignore[func-returns-value] File "E:\ai\langchain2\env\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\fastapi\applications.py", line 290, in call await super().call(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\errors.py", line 184, in call raise exc File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\errors.py", line 162, in call await self.app(scope, receive, _send) File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\exceptions.py", line 79, in call raise exc File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\exceptions.py", line 68, in call await self.app(scope, receive, sender) File "E:\ai\langchain2\env\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 20, in call raise e File "E:\ai\langchain2\env\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\routing.py", line 718, in call await route.handle(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\routing.py", line 276, in handle await self.app(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\routing.py", line 66, in app response = await func(request) File "E:\ai\langchain2\env\lib\site-packages\fastapi\routing.py", line 241, in app raw_response = await run_endpoint_function( File "E:\ai\langchain2\env\lib\site-packages\fastapi\routing.py", line 167, in run_endpoint_function return await dependant.call(values) File "E:\ai\langchain2\langchain-ChatGLM\server\knowledge_base\kb_doc_api.py", line 58, in upload_doc kb.add_doc(kb_file) File "E:\ai\langchain2\langchain-ChatGLM\server\knowledge_base\kb_service\base.py", line 78, in add_doc docs = kb_file.file2text() File "E:\ai\langchain2\langchain-ChatGLM\server\knowledge_base\utils.py", line 120, in file2text docs = loader.load_and_split(text_splitter) File "E:\ai\langchain2\env\lib\site-packages\langchain\document_loaders\base.py", line 43, in load_and_split docs = self.load() File "E:\ai\langchain2\env\lib\site-packages\langchain\document_loaders\unstructured.py", line 86, in load elements = self._get_elements() File "E:\ai\langchain2\env\lib\site-packages\langchain\document_loaders\unstructured.py", line 171, in _get_elements return partition(filename=self.file_path, self.unstructured_kwargs) File "E:\ai\langchain2\env\lib\site-packages\unstructured\partition\auto.py", line 249, in partition elements = partition_text( File "E:\ai\langchain2\env\lib\site-packages\unstructured\documents\elements.py", line 237, in wrapper elements = func(*args, *kwargs) File "E:\ai\langchain2\env\lib\site-packages\unstructured\file_utils\filetype.py", line 630, in wrapper elements = func(args, **kwargs) File "E:\ai\langchain2\env\lib\site-packages\unstructured\partition\text.py", line 212, in partition_text encoding, file_text = read_txt_file(filename=filename, encoding=encoding) File "E:\ai\langchain2\env\lib\site-packages\unstructured\file_utils\encoding.py", line 123, in read_txt_file formatted_encoding, file_text = detect_file_encoding(filename) File "E:\ai\langchain2\env\lib\site-packages\unstructured\file_utils\encoding.py", line 101, in detect_file_encoding file_text = byte_data.decode(encoding) UnicodeDecodeError: 'gb2312' codec can't decode byte 0xb6 in position 10697: illegal multibyte sequence

导入中断。 这个问题部分TXT文件会出现。这点看起来是gb2312编码问题。不知道应该如何解决?

环境信息 / Environment Information

附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.

zuoxiang95 commented 1 year ago

报错1: zh_core_web_sm这个是nlp里一个分词的model,你可以装一下这个库还有spacy这个库

imClumsyPanda commented 1 year ago
GPTlei commented 1 year ago

请问如何联系您?方便提供出现错误的文件。

imClumsyPanda commented 1 year ago

邮箱即可 littlepanda0716@gmail.com

另外今天项目已更新至v0.2.2,建议更新后检查是否仍存在问题。

Banana-Basilisk commented 1 year ago

INFO: 127.0.0.1:46876 - "POST /knowledge_base/upload_doc HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Producer', b'PyPDF2', /'Encoding', /b'software', b'']

我也遇到了类似的问题,不过我的问题似乎是pdf结构问题,对于同一份文件,有的时候会出现,有的时候不出现,不知道如何解决