dataelement / bisheng

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.
https://bisheng.dataelem.com/
Apache License 2.0
8.58k stars 1.6k forks source link

解析知识库文件失败,bishen-backend服务有报错. #732

Open SVENsuzhou opened 2 months ago

SVENsuzhou commented 2 months ago

版本:0.3.2.1 部署方式:docker-compose up -d 解析纯文本文件时可以成功的。但是上传一个word文件,后缀名为.doc。等待数秒后,解析失败,发现bishen-backend有如下报错

{"log":"[2024-07-07 21:38:10.081091] [2024-07-07 21:38:10.080849] [INFO process-9-140170971585408 bisheng.utils.http_middleware:18] - trace=88fcab02f86e4e38b4636146705db831 GET /api/v1/knowledge/file_list/4                                                                                                      \n","stream":"stdout","time":"2024-07-07T13:38:10.083205427Z"}
{"log":"[2024-07-07 21:38:10.092078] [2024-07-07 21:38:10.091910] [INFO process-9-140170971585408 bisheng.utils.http_middleware:21] - trace=88fcab02f86e4e38b4636146705db831 GET /api/v1/knowledge/file_list/4 200 timecost=11.098                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:10.093985907Z"}
{"log":"[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.\n","stream":"stderr","time":"2024-07-07T13:38:24.451004512Z"}
{"log":"[2024-07-07 21:38:24.507037] [2024-07-07 21:38:24.498674] [ERROR process-9-140169291212480 bisheng.api.services.knowledge_imp:337] - trace=89249bac13c24bc0bd86de25b459f9ee add_vectordb soffice command was not found. Please install libreoffice                                                          \n","stream":"stdout","time":"2024-07-07T13:38:24.540015902Z"}
{"log":"                             on your system and try again.                                                                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540047121Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.54005193Z"}
{"log":"                             - Install instructions: https://www.libreoffice.org/get-help/install-howto/                                                                                                                                                                                                    \n","stream":"stdout","time":"2024-07-07T13:38:24.54005605Z"}
{"log":"                             - Mac: https://formulae.brew.sh/cask/libreoffice                                                                                                                                                                                                                               \n","stream":"stdout","time":"2024-07-07T13:38:24.540060028Z"}
{"log":"                             - Debian: https://wiki.debian.org/LibreOfficeTraceback (most recent call last):                                                                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.540064005Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540068914Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/unstructured/partition/common.py\", line 141, in convert_office_doc                                                                                                                                                             \n","stream":"stdout","time":"2024-07-07T13:38:24.540073077Z"}
{"log":"                                 process = subprocess.Popen(                                                                                                                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.54007712Z"}
{"log":"                                           |          -\u003e \u003cclass 'subprocess.Popen'\u003e                                                                                                                                                                                                                         \n","stream":"stdout","time":"2024-07-07T13:38:24.540081148Z"}
{"log":"                                           -\u003e \u003cmodule 'subprocess' from '/usr/local/lib/python3.10/subprocess.py'\u003e                                                                                                                                                                                          \n","stream":"stdout","time":"2024-07-07T13:38:24.540096976Z"}
{"log":"                               File \"/usr/local/lib/python3.10/subprocess.py\", line 971, in __init__                                                                                                                                                                                                        \n","stream":"stdout","time":"2024-07-07T13:38:24.540100855Z"}
{"log":"                                 self._execute_child(args, executable, preexec_fn, close_fds,                                                                                                                                                                                                               \n","stream":"stdout","time":"2024-07-07T13:38:24.540104465Z"}
{"log":"                                 |    |              |     |           |           -\u003e True                                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540108159Z"}
{"log":"                                 |    |              |     |           -\u003e None                                                                                                                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.54011158Z"}
{"log":"                                 |    |              |     -\u003e None                                                                                                                                                                                                                                          \n","stream":"stdout","time":"2024-07-07T13:38:24.540115094Z"}
{"log":"                                 |    |              -\u003e ['soffice', '--headless', '--convert-to', 'docx', '--outdir', '/tmp/tmpeiusnwjx', '/root/.cache/bisheng/bisheng/a8f0e30db0bfe...                                                                                                                    \n","stream":"stdout","time":"2024-07-07T13:38:24.540118647Z"}
{"log":"                                 |    -\u003e \u003cfunction Popen._execute_child at 0x7f7c18aa13f0\u003e                                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540122026Z"}
{"log":"                                 -\u003e \u003cPopen: returncode: 255 args: ['soffice', '--headless', '--convert-to', 'doc...\u003e                                                                                                                                                                                        \n","stream":"stdout","time":"2024-07-07T13:38:24.540125619Z"}
{"log":"                               File \"/usr/local/lib/python3.10/subprocess.py\", line 1863, in _execute_child                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540129103Z"}
{"log":"                                 raise child_exception_type(errno_num, err_msg, err_filename)                                                                                                                                                                                                               \n","stream":"stdout","time":"2024-07-07T13:38:24.540133047Z"}
{"log":"                                       |                    |          |        -\u003e 'soffice'                                                                                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.540144679Z"}
{"log":"                                       |                    |          -\u003e 'No such file or directory'                                                                                                                                                                                                       \n","stream":"stdout","time":"2024-07-07T13:38:24.540148568Z"}
{"log":"                                       |                    -\u003e 2                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540152023Z"}
{"log":"                                       -\u003e \u003cclass 'OSError'\u003e                                                                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540155641Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540159172Z"}
{"log":"                             FileNotFoundError: [Errno 2] No such file or directory: 'soffice'                                                                                                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.540162682Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540166357Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540169781Z"}
{"log":"                             During handling of the above exception, another exception occurred:                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540173271Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540176688Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540180221Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540187371Z"}
{"log":"                             Traceback (most recent call last):                                                                                                                                                                                                                                             \n","stream":"stdout","time":"2024-07-07T13:38:24.540190958Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540194573Z"}
{"log":"                               File \"/usr/local/lib/python3.10/threading.py\", line 973, in _bootstrap                                                                                                                                                                                                       \n","stream":"stdout","time":"2024-07-07T13:38:24.540198037Z"}
{"log":"                                 self._bootstrap_inner()                                                                                                                                                                                                                                                    \n","stream":"stdout","time":"2024-07-07T13:38:24.540201737Z"}
{"log":"                                 |    -\u003e \u003cfunction Thread._bootstrap_inner at 0x7f7c18c3ce50\u003e                                                                                                                                                                                                               \n","stream":"stdout","time":"2024-07-07T13:38:24.540205165Z"}
{"log":"                                 -\u003e \u003cWorkerThread(AnyIO worker thread, started 140169291212480)\u003e                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540208675Z"}
{"log":"                               File \"/usr/local/lib/python3.10/threading.py\", line 1016, in _bootstrap_inner                                                                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.540212145Z"}
{"log":"                                 self.run()                                                                                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540215775Z"}
{"log":"                                 |    -\u003e \u003cfunction WorkerThread.run at 0x7f7bb5695870\u003e                                                                                                                                                                                                                      \n","stream":"stdout","time":"2024-07-07T13:38:24.540219278Z"}
{"log":"                                 -\u003e \u003cWorkerThread(AnyIO worker thread, started 140169291212480)\u003e                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540222748Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py\", line 859, in run                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540229828Z"}
{"log":"                                 result = context.run(func, *args)                                                                                                                                                                                                                                          \n","stream":"stdout","time":"2024-07-07T13:38:24.540233518Z"}
{"log":"                                          |       |   |      -\u003e ()                                                                                                                                                                                                                                          \n","stream":"stdout","time":"2024-07-07T13:38:24.540236881Z"}
{"log":"                                          |       |   -\u003e functools.partial(\u003cfunction addEmbedding at 0x7f7bc164a950\u003e, collection_name='col_1720359330_81d47543', index_name='col_17203...                                                                                                                   \n","stream":"stdout","time":"2024-07-07T13:38:24.540240448Z"}
{"log":"                                          |       -\u003e \u003cmethod 'run' of '_contextvars.Context' objects\u003e                                                                                                                                                                                                       \n","stream":"stdout","time":"2024-07-07T13:38:24.540244275Z"}
{"log":"                                          -\u003e \u003c_contextvars.Context object at 0x7f7bbf99c200\u003e                                                                                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.540248307Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540251878Z"}
{"log":"                             \u003e File \"/app/bisheng/api/services/knowledge_imp.py\", line 302, in addEmbedding                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540255429Z"}
{"log":"                                 texts, metadatas = read_chunk_text(path, knowledge_file.file_name, chunk_size,                                                                                                                                                                                             \n","stream":"stdout","time":"2024-07-07T13:38:24.540258939Z"}
{"log":"                                                    |               |     |              |          -\u003e 1000                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540262298Z"}
{"log":"                                                    |               |     |              -\u003e \u003csqlalchemy.orm.attributes.InstrumentedAttribute object at 0x7f7bc28d7420\u003e                                                                                                                                      \n","stream":"stdout","time":"2024-07-07T13:38:24.54026569Z"}
{"log":"                                                    |               |     -\u003e KnowledgeFile(knowledge_id=4, status=1, file_name='1.学生宿舍改造合同.doc', extra_meta=None, create_time=datetime.datetime(2024, 7, 7...                                                                                       \n","stream":"stdout","time":"2024-07-07T13:38:24.540272738Z"}
{"log":"                                                    |               -\u003e '/root/.cache/bisheng/bisheng/a8f0e30db0bfe2e5dfab4b5dba591b974648d3f4ea968b18e972d4ff727e9c21.doc'                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540277161Z"}
{"log":"                                                    -\u003e \u003cfunction read_chunk_text at 0x7f7bc164a9e0\u003e                                                                                                                                                                                                         \n","stream":"stdout","time":"2024-07-07T13:38:24.540280603Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540284108Z"}
{"log":"                               File \"/app/bisheng/api/services/knowledge_imp.py\", line 371, in read_chunk_text                                                                                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.54028752Z"}
{"log":"                                 documents = loader.load()                                                                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540290929Z"}
{"log":"                                             |      -\u003e \u003cfunction BaseLoader.load at 0x7f7bf0f68940\u003e                                                                                                                                                                                                         \n","stream":"stdout","time":"2024-07-07T13:38:24.5402943Z"}
{"log":"                                             -\u003e \u003clangchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader object at 0x7f7b90350520\u003e                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.540298008Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540301525Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/langchain_core/document_loaders/base.py\", line 29, in load                                                                                                                                                                     \n","stream":"stdout","time":"2024-07-07T13:38:24.540304897Z"}
{"log":"                                 return list(self.lazy_load())                                                                                                                                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.54030835Z"}
{"log":"                                             |    -\u003e \u003cfunction UnstructuredBaseLoader.lazy_load at 0x7f7bc5cfe5f0\u003e                                                                                                                                                                                          \n","stream":"stdout","time":"2024-07-07T13:38:24.540315451Z"}
{"log":"                                             -\u003e \u003clangchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader object at 0x7f7b90350520\u003e                                                                                                                                                \n","stream":"stdout","time":"2024-07-07T13:38:24.540319042Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py\", line 87, in lazy_load                                                                                                                                                   \n","stream":"stdout","time":"2024-07-07T13:38:24.540322572Z"}
{"log":"                                 elements = self._get_elements()                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540326235Z"}
{"log":"                                            |    -\u003e \u003cfunction UnstructuredWordDocumentLoader._get_elements at 0x7f7bc38ec550\u003e                                                                                                                                                                               \n","stream":"stdout","time":"2024-07-07T13:38:24.540329742Z"}
{"log":"                                            -\u003e \u003clangchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader object at 0x7f7b90350520\u003e                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540333241Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/langchain_community/document_loaders/word_document.py\", line 120, in _get_elements                                                                                                                                             \n","stream":"stdout","time":"2024-07-07T13:38:24.540336801Z"}
{"log":"                                 return partition_doc(filename=self.file_path, **self.unstructured_kwargs)                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540340268Z"}
{"log":"                                        |                      |    |            |    -\u003e {}                                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540343603Z"}
{"log":"                                        |                      |    |            -\u003e \u003clangchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader object at 0x7f7b90350520\u003e                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540347241Z"}
{"log":"                                        |                      |    -\u003e '/root/.cache/bisheng/bisheng/a8f0e30db0bfe2e5dfab4b5dba591b974648d3f4ea968b18e972d4ff727e9c21.doc'                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540353935Z"}
{"log":"                                        |                      -\u003e \u003clangchain_community.document_loaders.word_document.UnstructuredWordDocumentLoader object at 0x7f7b90350520\u003e                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.540357513Z"}
{"log":"                                        -\u003e \u003cfunction partition_doc at 0x7f7b47cf1240\u003e                                                                                                                                                                                                                       \n","stream":"stdout","time":"2024-07-07T13:38:24.540361172Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/unstructured/documents/elements.py\", line 138, in wrapper                                                                                                                                                                      \n","stream":"stdout","time":"2024-07-07T13:38:24.540364647Z"}
{"log":"                                 elements = func(*args, **kwargs)                                                                                                                                                                                                                                           \n","stream":"stdout","time":"2024-07-07T13:38:24.540368095Z"}
{"log":"                                            |     |       -\u003e {'filename': '/root/.cache/bisheng/bisheng/a8f0e30db0bfe2e5dfab4b5dba591b974648d3f4ea968b18e972d4ff727e9c21.doc'}                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.540380099Z"}
{"log":"                                            |     -\u003e ()                                                                                                                                                                                                                                                     \n","stream":"stdout","time":"2024-07-07T13:38:24.540383836Z"}
{"log":"                                            -\u003e \u003cfunction partition_doc at 0x7f7b47cf1cf0\u003e                                                                                                                                                                                                                   \n","stream":"stdout","time":"2024-07-07T13:38:24.540387493Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py\", line 519, in wrapper                                                                                                                                                                     \n","stream":"stdout","time":"2024-07-07T13:38:24.540391095Z"}
{"log":"                                 elements = func(*args, **kwargs)                                                                                                                                                                                                                                           \n","stream":"stdout","time":"2024-07-07T13:38:24.540394658Z"}
{"log":"                                            |     |       -\u003e {'filename': '/root/.cache/bisheng/bisheng/a8f0e30db0bfe2e5dfab4b5dba591b974648d3f4ea968b18e972d4ff727e9c21.doc'}                                                                                                                              \n","stream":"stdout","time":"2024-07-07T13:38:24.540398163Z"}
{"log":"                                            |     -\u003e ()                                                                                                                                                                                                                                                     \n","stream":"stdout","time":"2024-07-07T13:38:24.540405163Z"}
{"log":"                                            -\u003e \u003cfunction partition_doc at 0x7f7b47cf12d0\u003e                                                                                                                                                                                                                   \n","stream":"stdout","time":"2024-07-07T13:38:24.540408668Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/unstructured/partition/doc.py\", line 49, in partition_doc                                                                                                                                                                      \n","stream":"stdout","time":"2024-07-07T13:38:24.540412369Z"}
{"log":"                                 convert_office_doc(filename, tmpdir, target_format=\"docx\")                                                                                                                                                                                                                 \n","stream":"stdout","time":"2024-07-07T13:38:24.540415776Z"}
{"log":"                                 |                  |         -\u003e '/tmp/tmpeiusnwjx'                                                                                                                                                                                                                         \n","stream":"stdout","time":"2024-07-07T13:38:24.540419285Z"}
{"log":"                                 |                  -\u003e '/root/.cache/bisheng/bisheng/a8f0e30db0bfe2e5dfab4b5dba591b974648d3f4ea968b18e972d4ff727e9c21.doc'                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540422666Z"}
{"log":"                                 -\u003e \u003cfunction convert_office_doc at 0x7f7b91f17640\u003e                                                                                                                                                                                                                         \n","stream":"stdout","time":"2024-07-07T13:38:24.540426061Z"}
{"log":"                               File \"/usr/local/lib/python3.10/site-packages/unstructured/partition/common.py\", line 148, in convert_office_doc                                                                                                                                                             \n","stream":"stdout","time":"2024-07-07T13:38:24.540429526Z"}
{"log":"                                 raise FileNotFoundError(                                                                                                                                                                                                                                                   \n","stream":"stdout","time":"2024-07-07T13:38:24.540432912Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.54043651Z"}
{"log":"                             FileNotFoundError: soffice command was not found. Please install libreoffice                                                                                                                                                                                                   \n","stream":"stdout","time":"2024-07-07T13:38:24.540439892Z"}
{"log":"                             on your system and try again.                                                                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540450735Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540454222Z"}
{"log":"                             - Install instructions: https://www.libreoffice.org/get-help/install-howto/                                                                                                                                                                                                    \n","stream":"stdout","time":"2024-07-07T13:38:24.540457583Z"}
{"log":"                             - Mac: https://formulae.brew.sh/cask/libreoffice                                                                                                                                                                                                                               \n","stream":"stdout","time":"2024-07-07T13:38:24.54046094Z"}
{"log":"                             - Debian: https://wiki.debian.org/LibreOffice                                                                                                                                                                                                                                  \n","stream":"stdout","time":"2024-07-07T13:38:24.540464322Z"}
{"log":"                                                                                                                                                                                                                                                                                                            \n","stream":"stdout","time":"2024-07-07T13:38:24.540467955Z"}
yaojin3616 commented 2 months ago

backend不带libreoffice,需要你手动安装下。 apt-get install -y libreoffice