aymenfurter / smartrag

Elevating RAG with Multi-Agent Systems
32 stars 4 forks source link

The file upload failed. #14

Open zhouyujn opened 6 hours ago

zhouyujn commented 6 hours ago

@aymenfurter The file upload encountered an HTTP 500 error. I wonder if the file hasn’t passed through Document Intelligence? The PDF files are fine, but Word or other file formats encounter errors.

error500

2024-10-21T02:43:41.099251851Z INFO:geventwebsocket.handler:100.100.0.115 - - [2024-10-21 02:43:41] "GET /indexes/espp/files?is_restricted=false HTTP/1.1" 200 171 0.007272 2024-10-21T02:43:41.376999492Z INFO:geventwebsocket.handler:100.100.0.115 - - [2024-10-21 02:43:41] "GET /indexes HTTP/1.1" 200 169 0.045113 2024-10-21T02:43:42.110639426Z INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: 'https://strwxmbueydoikkg.queue.core.windows.net/indexing/messages?numofmessages=REDACTED&visibilitytimeout=REDACTED' 2024-10-21T02:43:42.110684370Z Request method: 'GET' 2024-10-21T02:43:42.110695080Z Request headers: 2024-10-21T02:43:42.110703155Z 'x-ms-version': 'REDACTED' 2024-10-21T02:43:42.110710829Z 'Accept': 'application/xml' 2024-10-21T02:43:42.110718764Z 'User-Agent': 'azsdk-python-storage-queue/12.11.0 Python/3.11.10 (Linux-5.15.164.1-1.cm2-x86_64-with-glibc2.36)' 2024-10-21T02:43:42.110726770Z 'x-ms-date': 'REDACTED' 2024-10-21T02:43:42.110734664Z 'x-ms-client-request-id': '48ae17ca-8f56-11ef-8667-3e4e57cc0722' 2024-10-21T02:43:42.110741838Z 'Authorization': 'REDACTED' 2024-10-21T02:43:42.110749071Z No body was attached to the request 2024-10-21T02:43:42.115860276Z INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 200 2024-10-21T02:43:42.115884081Z Response headers: 2024-10-21T02:43:42.115894380Z 'Cache-Control': 'no-cache' 2024-10-21T02:43:42.115903427Z 'Transfer-Encoding': 'chunked' 2024-10-21T02:43:42.115911342Z 'Content-Type': 'application/xml' 2024-10-21T02:43:42.115918606Z 'Server': 'Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0' 2024-10-21T02:43:42.115925689Z 'x-ms-request-id': 'dec4a70e-2003-0054-0263-23f559000000' 2024-10-21T02:43:42.115933443Z 'x-ms-client-request-id': '48ae17ca-8f56-11ef-8667-3e4e57cc0722' 2024-10-21T02:43:42.115940537Z 'x-ms-version': 'REDACTED' 2024-10-21T02:43:42.115947640Z 'Date': 'Mon, 21 Oct 2024 02:43:41 GMT' 2024-10-21T02:43:45.406998738Z ERROR:root:Error getting PDF page count: EOF marker not found 2024-10-21T02:43:45.407628523Z ERROR:main:Exception on /indexes/espp/upload [POST] 2024-10-21T02:43:45.407666334Z Traceback (most recent call last): 2024-10-21T02:43:45.407677155Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1473, in wsgi_app 2024-10-21T02:43:45.407685611Z response = self.full_dispatch_request() 2024-10-21T02:43:45.407693956Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407702672Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 882, in full_dispatch_request 2024-10-21T02:43:45.407711248Z rv = self.handle_user_exception(e) 2024-10-21T02:43:45.407718973Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407726527Z File "/usr/local/lib/python3.11/site-packages/flask_cors/extension.py", line 178, in wrapped_function 2024-10-21T02:43:45.407734732Z return cors_after_request(app.make_response(f(*args, kwargs))) 2024-10-21T02:43:45.407741976Z ^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407749510Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 880, in full_dispatch_request 2024-10-21T02:43:45.407757455Z rv = self.dispatch_request() 2024-10-21T02:43:45.407765060Z ^^^^^^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407773124Z File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 865, in dispatch_request 2024-10-21T02:43:45.407780558Z return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] 2024-10-21T02:43:45.407802430Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407810214Z File "/app/app/api/routes.py", line 213, in _upload_file 2024-10-21T02:43:45.407817638Z num_pages = get_pdf_page_count(file_buffer) 2024-10-21T02:43:45.407825593Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407833478Z File "/app/app/ingestion/pdf_processing.py", line 24, in get_pdf_page_count 2024-10-21T02:43:45.407841393Z reader = PdfReader(pdf_bytes) 2024-10-21T02:43:45.407849067Z ^^^^^^^^^^^^^^^^^^^^ 2024-10-21T02:43:45.407857022Z File "/usr/local/lib/python3.11/site-packages/PyPDF2/_reader.py", line 319, in init 2024-10-21T02:43:45.407864606Z self.read(stream) 2024-10-21T02:43:45.407872281Z File "/usr/local/lib/python3.11/site-packages/PyPDF2/_reader.py", line 1415, in read 2024-10-21T02:43:45.407879564Z self._find_eof_marker(stream) 2024-10-21T02:43:45.407887118Z File "/usr/local/lib/python3.11/site-packages/PyPDF2/_reader.py", line 1471, in _find_eof_marker 2024-10-21T02:43:45.407894743Z raise PdfReadError("EOF marker not found") 2024-10-21T02:43:45.407902166Z PyPDF2.errors.PdfReadError: EOF marker not found

aymenfurter commented 3 hours ago

Thank you for raising this issue. Currently, SmartRAG can only handle PDF files, as I am post-processing them under the assumption that they are PDFs. Other file formats, such as Word documents, are not yet supported, but they may be added in the future.