Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.91k stars 4.05k forks source link

form recognizer begin_analyze_document timeout on large files #485

Open davidwboyd opened 1 year ago

davidwboyd commented 1 year ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Attempt to process a document of 390 or more pages

Any log messages given by the failure

    Uploading blob for page 390 -> PA - Sch 23 - Extracts from Proposal-390.pdf

Extracting text from 'C:\Users\dboyd\Documents\DesignSpecs/data\PA - Sch 23 - Extracts from Proposal.pdf' using Azure Form Recognizer Traceback (most recent call last): File "C:\Users\dboyd\Documents\DesignSpecs\scripts\prepdocs.py", line 379, in page_map = get_document_text(filename) File "C:\Users\dboyd\Documents\DesignSpecs\scripts\prepdocs.py", line 111, in get_document_text poller = form_recognizer_client.begin_analyze_document("prebuilt-layout", document = f) File "C:\Users\dboyd\Documents\DesignSpecs\scripts.venv\lib\site-packages\azure\core\tracing\decorator.py", line 76, in wrapper_use_tracer return func(args, kwargs) File "C:\Users\dboyd\Documents\DesignSpecs\scripts.venv\lib\site-packages\azure\ai\formrecognizer_document_analysis_client.py", line 126, in begin_analyze_document return self._client.begin_analyze_document( # type: ignore File "C:\Users\dboyd\Documents\DesignSpecs\scripts.venv\lib\site-packages\azure\ai\formrecognizer_generated_operations_mixin.py", line 170, in begin_analyze_document return mixin_instance.begin_analyze_document(model_id, pages, locale, string_index_type, analyze_request, kwargs) File "C:\Users\dboyd\Documents\DesignSpecs\scripts.venv\lib\site-packages\azure\core\tracing\decorator.py", line 76, in wrapper_use_tracer return func(args, **kwargs) File "C:\Users\dboyd\Documents\DesignSpecs\scripts.venv\lib\site-packages\azure\ai\formrecognizer_generated\v2022_08_31\operations_form_recognizer_client_operations.py", line 576, in begin_analyze_document raw_result = self._analyze_document_initial( # type: ignore File "C:\Users\dboyd\Documents\DesignSpecs\scripts.venv\lib\site-packages\azure\ai\formrecognizer_generated\v2022_08_31\operations_form_recognizer_client_operations.py", line 508, in _analyze_document_initial raise HttpResponseError(response=response) azure.core.exceptions.HttpResponseError: (Timeout) The operation was timeout. Code: Timeout Message: The operation was timeout.

Expected/desired behavior

Need to be able to set a longer timeout for large files in the being_analyze_document call.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Windows 10

azd version?

run azd version and copy paste here. azd version 1.1.0 (commit ea9cb12575734ee6a5f99c4d415c1a51d6f32d3e)

Versions

Mention any other details that might be useful

THe below is the code that is timing out: with open(filename, "rb") as f: poller = form_recognizer_client.begin_analyze_document("prebuilt-layout", document = f)

Given that the entire bytestream of the large file has to be sent to the endpoint this looks like a straight HTTP timeout. However, there is no place in the API documentation to change the timeout for the begin_analyze_document call.

I do not believe that re-writing the example to use async IO will work as this is an endpoint timeout.


Thanks! We'll be in touch soon.

pamelafox commented 1 year ago

I also don't see anything in the docs to extend the timeout: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/formrecognizer/azure-ai-formrecognizer/README.md

You could log an issue in the azure-sdk-for-python repo about this to see if they have any feedback. However, it may just be a limitation of the underlying API. So a workaround would be to preprocess the PDF to split it into smaller documents.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

mvfpoa commented 9 months ago

I am planning to build an app with Azure Document Intelligence and while testing the capabilities of this service, I also found this issue when trying to convert a large file. Looks like this is not a priority, perhaps I can split the PDF prior to sending it,,,

zainif commented 8 months ago

Is there any update on this? I am getting the following error when trying to analyze a pdf of 5MB:

"azure.core.exceptions.HttpResponseError: (Timeout) The operation was timeout. Code: Timeout Message: The operation was timeout."

I'd rather not have to split the document into smaller chunks beforehand. Any ideas / solutions?

felixng313 commented 8 months ago

I'm encountering the same error with the REST API.

{ "error": { "code": "Timeout", "message": "The operation was timeout." } }

alecswjo commented 7 months ago

+1.

The only solution seems to be adding more document intelligence services and splitting up the doc into smaller chunks, which isn't a great solution. Would love a timeout or parallelism functionality.

rohit-ganguly commented 7 months ago

Hi all, thanks for the feedback. I've created an issue in our Azure SDK repo and we'll investigate ASAP.

pamelafox commented 7 months ago

Is anyone on the thread able to share a PDF that resulted in a timeout? If so, please email to pamelafox at microsoft . com

felixng313 commented 7 months ago

Is anyone on the thread able to share a PDF that resulted in a timeout? If so, please email to pamelafox at microsoft . com

@pamelafox Please check your inbox as I have sent you a sample file to reproduce this issue. Furthermore, this issue occurs when using the Markdown output format.

YalinLi0312 commented 1 week ago

Is anyone on the thread able to share a PDF that resulted in a timeout? If so, please email to pamelafox at microsoft . com

Hi @pamelafox , just want to check if you have received any file? I've tested with a 426 pages PDF in 16936kb, but didn't reproduce the issue.

mikedizon commented 1 week ago

i'll share one tomorrow. @pamelafox

YalinLi0312 commented 1 week ago

@mikedizon can you also share it to yall@microsoft.com?

pamelafox commented 1 week ago

@YalinLi0312 I've received a few files, but had intermittent success reproducing. If you're able to reproduce as well, that'd be great.

mikedizon commented 1 week ago

@YalinLi0312 @pamelafox curious to hear if you encountered the same issues I had with that file.