Closed GPTlei closed 11 months ago
报错1: zh_core_web_sm这个是nlp里一个分词的model,你可以装一下这个库还有spacy这个库
报错1: 不影响实际运行,项目中默认使用spacy分词,但如果没有安装相关依赖,将自动使用RecursiveCharacterTextSplitter。如需使用spacy,请执行:
$ pip install spacy
$ python -m spacy download zh_core_web_sm
报错2: 文件encoding检测问题,请问是否方便提供文件测试用例?
请问如何联系您?方便提供出现错误的文件。
邮箱即可 littlepanda0716@gmail.com
另外今天项目已更新至v0.2.2,建议更新后检查是否仍存在问题。
INFO: 127.0.0.1:46876 - "POST /knowledge_base/upload_doc HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Producer', b'PyPDF2', /'Encoding', /b'software', b'']
我也遇到了类似的问题,不过我的问题似乎是pdf结构问题,对于同一份文件,有的时候会出现,有的时候不出现,不知道如何解决
问题描述 / Problem Description LLM已经跑起来,知识库也能运行。导入新的TXT时,出现以下2种报错。
报错1: 复现问题的步骤 / Steps to Reproduce
page_content='《求是》杂志发表习近平总书记重要文章《全面从严治党探索出依靠党的自我革命跳出历史周期率的成功路径》\n\n来源:新华网 发布时间:2023-01-31\n\n新华社北京1月31日电 2月1日出版的第3期《求是》杂志将发表中共中央总书记、国家主 席、中央军委主席习近平的重要文章《全面从严治党探索出依靠党的自我革命跳出历史周期率的成功路径》。' metadata={'source': 'E:\ai\langchain2\langchain-ChatGLM\knowledge_base\xijinping\content\2023-01-31-《求是》杂志发表习近平总书记重要文 章《全面从严治党探索出依靠党的自我革命跳出历史周期率的成功路径》.txt'} Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 35.71it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.80it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.04it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.46it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.03it/s] Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.67it/s] INFO: 127.0.0.1:65362 - "POST /knowledge_base/upload_doc HTTP/1.1" 200 OK UnstructuredFileLoader [E050] Can't find model 'zh_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
这个报错每个txt都会出现,能继续下条导入。
报错2: INFO: 127.0.0.1:49392 - "POST /knowledge_base/upload_doc HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "E:\ai\langchain2\env\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 408, in run_asgi result = await app( # type: ignore[func-returns-value] File "E:\ai\langchain2\env\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 84, in call return await self.app(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\fastapi\applications.py", line 290, in call await super().call(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\errors.py", line 184, in call raise exc File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\errors.py", line 162, in call await self.app(scope, receive, _send) File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\exceptions.py", line 79, in call raise exc File "E:\ai\langchain2\env\lib\site-packages\starlette\middleware\exceptions.py", line 68, in call await self.app(scope, receive, sender) File "E:\ai\langchain2\env\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 20, in call raise e File "E:\ai\langchain2\env\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\routing.py", line 718, in call await route.handle(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\routing.py", line 276, in handle await self.app(scope, receive, send) File "E:\ai\langchain2\env\lib\site-packages\starlette\routing.py", line 66, in app response = await func(request) File "E:\ai\langchain2\env\lib\site-packages\fastapi\routing.py", line 241, in app raw_response = await run_endpoint_function( File "E:\ai\langchain2\env\lib\site-packages\fastapi\routing.py", line 167, in run_endpoint_function return await dependant.call(values) File "E:\ai\langchain2\langchain-ChatGLM\server\knowledge_base\kb_doc_api.py", line 58, in upload_doc kb.add_doc(kb_file) File "E:\ai\langchain2\langchain-ChatGLM\server\knowledge_base\kb_service\base.py", line 78, in add_doc docs = kb_file.file2text() File "E:\ai\langchain2\langchain-ChatGLM\server\knowledge_base\utils.py", line 120, in file2text docs = loader.load_and_split(text_splitter) File "E:\ai\langchain2\env\lib\site-packages\langchain\document_loaders\base.py", line 43, in load_and_split docs = self.load() File "E:\ai\langchain2\env\lib\site-packages\langchain\document_loaders\unstructured.py", line 86, in load elements = self._get_elements() File "E:\ai\langchain2\env\lib\site-packages\langchain\document_loaders\unstructured.py", line 171, in _get_elements return partition(filename=self.file_path, self.unstructured_kwargs) File "E:\ai\langchain2\env\lib\site-packages\unstructured\partition\auto.py", line 249, in partition elements = partition_text( File "E:\ai\langchain2\env\lib\site-packages\unstructured\documents\elements.py", line 237, in wrapper elements = func(*args, *kwargs) File "E:\ai\langchain2\env\lib\site-packages\unstructured\file_utils\filetype.py", line 630, in wrapper elements = func(args, **kwargs) File "E:\ai\langchain2\env\lib\site-packages\unstructured\partition\text.py", line 212, in partition_text encoding, file_text = read_txt_file(filename=filename, encoding=encoding) File "E:\ai\langchain2\env\lib\site-packages\unstructured\file_utils\encoding.py", line 123, in read_txt_file formatted_encoding, file_text = detect_file_encoding(filename) File "E:\ai\langchain2\env\lib\site-packages\unstructured\file_utils\encoding.py", line 101, in detect_file_encoding file_text = byte_data.decode(encoding) UnicodeDecodeError: 'gb2312' codec can't decode byte 0xb6 in position 10697: illegal multibyte sequence
导入中断。 这个问题部分TXT文件会出现。这点看起来是gb2312编码问题。不知道应该如何解决?
环境信息 / Environment Information
附加信息 / Additional Information 添加与问题相关的任何其他信息 / Add any other information related to the issue.