Azure-Samples / chat-with-your-data-solution-accelerator

A Solution Accelerator for the RAG pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences. This includes most common requirements and best practices.
https://azure.microsoft.com/products/search
MIT License
638 stars 312 forks source link

Error processing pdf, jpg/png files #874

Closed eosho closed 1 month ago

eosho commented 1 month ago

Describe the bug

When a pdf, jpg or png file is uploaded via the admin portal, the batch_push_results function app fails with a file is corrupted... error:

Expected behavior

When pdf, png or jpg file are uploaded, it's expected to be supported and processed via form recognizer.

How does this bug make you feel?

Share a gif from giphy to tells us how you'd feel

:) lol

Debugging information

Steps to reproduce

Steps to reproduce the behavior:

  1. Upload a pdf, png or jpg file
  2. Process them & check the batch_push_results function for errors
  3. The following error is generated "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." }
  4. These files are available for deletion but nothing more.

Screenshots

If applicable, add screenshots to help explain your problem.

Logs

If applicable, add logs to help the engineer debug the problem.

Executing 'Functions.batch_push_results' (Reason='New queue message detected on 'doc-processing'.', Id=b9384d92--xxxx)
Python queue trigger function processed a queue item: {"filename": "Frequently Asked Questions.pdf"}
Result: Failure Exception: ValueError: Error: Traceback (most recent call last): File "/home/site/wwwroot/utilities/helpers/AzureFormRecognizerHelper.py", line 78, in begin_analyze_document_from_url poller = self.document_analysis_client.begin_analyze_document_from_url( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/azure/core/tracing/decorator.py", line 89, in wrapper_use_tracer return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/azure/ai/formrecognizer/_document_analysis_client.py", line 198, in begin_analyze_document_from_url return _client_op_path.begin_analyze_document( # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/azure/core/tracing/decorator.py", line 89, in wrapper_use_tracer return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/azure/ai/formrecognizer/_generated/v2023_07_31/operations/_document_models_operations.py", line 518, in begin_analyze_document raw_result = self._analyze_document_initial( # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/azure/ai/formrecognizer/_generated/v2023_07_31/operations/_document_models_operations.py", line 443, in _analyze_document_initial raise HttpResponseError(response=response) azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." } . Error: (InvalidRequest) Invalid request. Code: InvalidRequest Message: Invalid request. Inner error: { "code": "InvalidContent", "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats." } Stack: File "/azure-functions-host/workers/python/3.11/LINUX/X64/azure_functions_worker/dispatcher.py", line 545, in _handle__invocation_request call_result = await self._loop.run_in_executor( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azure-functions-host/workers/python/3.11/LINUX/X64/azure_functions_worker/dispatcher.py", line 826, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azure-functions-host/workers/python/3.11/LINUX/X64/azure_functions_worker/extension.py", line 215, in _raw_invocation_wrapper result = function(**args) ^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/BatchPushResults.py", line 30, in batch_push_results do_batch_push_results(msg) File "/home/site/wwwroot/BatchPushResults.py", line 47, in do_batch_push_results embedder.embed_file(file_sas, file_name) File "/home/site/wwwroot/utilities/helpers/embedders/PushEmbedder.py", line 37, in embed_file self.__embed(source_url=source_url, embedding_config=embedding_config) File "/home/site/wwwroot/utilities/helpers/embedders/PushEmbedder.py", line 46, in __embed documents: List[SourceDocument] = self.document_loading.load( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/utilities/helpers/DocumentLoadingHelper.py", line 17, in load return loader.load(document_url) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/utilities/document_loading/Layout.py", line 13, in load pages_content = azure_form_recognizer_client.begin_analyze_document_from_url( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/utilities/helpers/AzureFormRecognizerHelper.py", line 147, in begin_analyze_document_from_url raise ValueError(f"Error: {traceback.format_exc()}. Error: {e}")

Tasks

To be filled in by the engineer picking up the issue

eosho commented 1 month ago

Closing this. Identified the issue as rbac related.