PDF failed to process - Githubissues

jschulman commented 1 month ago

Within LibreChat using a git pull from this morning and updated .env and librachat.yml files, I attach a PDF and submit the prompt. I get error "An error occurred while processing your request." This is the log files:

rag_api | 2024-05-19 01:58:23,615 - root - INFO - Request POST http://rag_api:8000/embed - 200 rag_api | 2024-05-19 01:58:32,233 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" rag_api | 2024-05-19 01:58:32,353 - root - ERROR - list index out of range rag_api | 2024-05-19 01:58:32,353 - root - INFO - Request POST http://rag_api:8000/query - 500 LibreChat | 2024-05-19 01:58:32 error: Error creating context: Request failed with status code 500 LibreChat | 2024-05-19 01:58:32 error: [handleAbortError] AI response error; aborting request: Request failed with status code 500

danny-avila commented 1 month ago

can you set DEBUG_RAG_API=True in your .env file and see if you can recreate the error?

jschulman commented 1 month ago

I've run it through a wide variety of PDFs. There is something unique about this PDF that it doesn't like. Debug logs below. Here is the PDF metadata:

_kMDItemDisplayNameWithExtensions = "name.pdf" com_apple_metadata_modtime = 735652485 kMDItemContentCreationDate = 2024-04-24 11:54:45 +0000 kMDItemContentCreationDate_Ranking = 2024-05-15 00:00:00 +0000 kMDItemContentModificationDate = 2024-04-24 11:54:45 +0000 kMDItemContentType = "com.adobe.pdf" kMDItemContentTypeTree = ( "com.adobe.pdf", "public.data", "public.item", "public.composite-content", "public.content" ) kMDItemDateAdded = 2024-05-15 04:06:30 +0000 kMDItemDisplayName = "name.pdf" kMDItemDocumentIdentifier = 415425 kMDItemFSContentChangeDate = 2024-04-24 11:54:45 +0000 kMDItemFSCreationDate = 2024-04-24 11:54:45 +0000 kMDItemFSCreatorCode = "" kMDItemFSFinderFlags = 0 kMDItemFSHasCustomIcon = (null) kMDItemFSInvisible = 0 kMDItemFSIsExtensionHidden = 0 kMDItemFSIsStationery = (null) kMDItemFSLabel = 0 kMDItemFSName = "name.pdf" kMDItemFSNodeCount = (null) kMDItemFSOwnerGroupID = 20 kMDItemFSOwnerUserID = 501 kMDItemFSSize = 488604 kMDItemFSTypeCode = "" kMDItemInterestingDate_Ranking = 2024-05-18 00:00:00 +0000 kMDItemKind = "PDF document" kMDItemLastUsedDate = 2024-05-18 17:52:12 +0000 kMDItemLastUsedDate_Ranking = 2024-05-18 00:00:00 +0000 kMDItemLogicalSize = 488604 kMDItemPhysicalSize = 488604 kMDItemUseCount = 9 kMDItemUsedDates = ( "2024-05-12 05:00:00 +0000", "2024-05-18 05:00:00 +0000" )

LOGS:

rag_api | 2024-05-19 20:00:47,000 - root - DEBUG - /query - {'id': 'x', 'username': 'x', 'provider': 'local', 'email': 'x', 'iat': x, 'exp': x} rag_api | 2024-05-19 20:00:47,032 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443 rag_api | 2024-05-19 20:00:47,307 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/cl100k_base.tiktoken HTTP/1.1" 200 1681126 rag_api | 2024-05-19 20:00:47,839 - openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create..parser at 0x14fc1c5b4c10>, 'json_data': {'input': [[1264, 5730, 553]], 'model': 'text-embedding-3-small', 'encoding_format': 'base64'}} rag_api | 2024-05-19 20:00:48,061 - openai._base_client - DEBUG - Sending HTTP Request: POST https://api.openai.com/v1/embeddings rag_api | 2024-05-19 20:00:48,062 - httpcore.connection - DEBUG - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=None socket_options=None rag_api | 2024-05-19 20:00:48,360 - httpcore.connection - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x14fc198fbb50> rag_api | 2024-05-19 20:00:48,360 - httpcore.connection - DEBUG - start_tls.started ssl_context=<ssl.SSLContext object at 0x14fc1c9f3e40> server_hostname='api.openai.com' timeout=None rag_api | 2024-05-19 20:00:48,381 - httpcore.connection - DEBUG - start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x14fc198fbb80> rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_headers.started request=<Request [b'POST']> rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_headers.complete rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_body.started request=<Request [b'POST']> rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_body.complete rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - receive_response_headers.started request=<Request [b'POST']> rag_api | 2024-05-19 20:00:48,542 - httpcore.http11 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Sun, 19 May 2024 20:00:48 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b''), (b'openai-model', b'text-embedding-3-small'), (b'openai-organization', b'one37'), (b'openai-processing-ms', b'25'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'5000'), (b'x-ratelimit-limit-tokens', b'5000000'), (b'x-ratelimit-remaining-requests', b'4999'), (b'x-ratelimit-remaining-tokens', b'4999996'), (b'x-ratelimit-reset-requests', b'12ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_x'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Set-Cookie', b'__cf_bm=x-1.0.1.1-x; path=/; expires=Sun, 19-May-24 20:30:48 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Set-Cookie', b'_cfuvid=x-0.0.1.1-x; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8866acde7cb52d4c-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')]) rag_api | 2024-05-19 20:00:48,543 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - receive_response_body.started request=<Request [b'POST']> rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - receive_response_body.complete rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - response_closed.started rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - response_closed.complete rag_api | 2024-05-19 20:00:48,544 - openai._base_client - DEBUG - HTTP Response: POST https://api.openai.com/v1/embeddings "200 OK" Headers([('date', 'Sun, 19 May 2024 20:00:48 GMT'), ('content-type', 'application/json'), ('transfer-encoding', 'chunked'), ('connection', 'keep-alive'), ('access-control-allow-origin', ''), ('openai-model', 'text-embedding-3-small'), ('openai-organization', 'x'), ('openai-processing-ms', '25'), ('openai-version', '2020-10-01'), ('strict-transport-security', 'max-age=15724800; includeSubDomains'), ('x-ratelimit-limit-requests', '5000'), ('x-ratelimit-limit-tokens', '5000000'), ('x-ratelimit-remaining-requests', '4999'), ('x-ratelimit-remaining-tokens', '4999996'), ('x-ratelimit-reset-requests', '12ms'), ('x-ratelimit-reset-tokens', '0s'), ('x-request-id', 'req_7eaa45631d94004341818ccd734162c6'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=x-1.0.1.1-x; path=/; expires=Sun, 19-May-24 20:30:48 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('set-cookie', '_cfuvid=x-0.0.1.1-x; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '8866acde7cb52d4c-ORD'), ('content-encoding', 'gzip'), ('alt-svc', 'h3=":443"; ma=86400')]) rag_api | 2024-05-19 20:00:48,544 - openai._base_client - DEBUG - request_id: req_7eaa45631d94004341818ccd734162c6 rag_api | 2024-05-19 20:00:48,557 - root - ERROR - list index out of range rag_api | 2024-05-19 20:00:48,557 - root - INFO - Request POST http://rag_api:8000/query - 500 LibreChat | 2024-05-19 20:00:48 error: Error creating context: Request failed with status code 500 LibreChat | 2024-05-19 20:00:48 error: [handleAbortError] AI response error; aborting request: Request failed with status code 500

danny-avila commented 1 month ago

I've "fixed" this issue and it seems that MongoDB Atlas reliably produces it by not returning any results. They are now handled but mongodb integration will have to go through more extensive review.

danny-avila / rag_api

PDF failed to process #39