Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.89k stars 4.03k forks source link

Getting error during extracting text from pdf while ding deployment #885

Open TarunKC261 opened 10 months ago

TarunKC261 commented 10 months ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

TarunKC261 commented 10 months ago

Uploading blob for whole file -> Deep Learning.pdf Extracting text from 'C:\CsuEnterpriseSearch/data\Introduction_to_algorithms-3rd Edition.pdf' using Azure Form Recognizer Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 484, in load_body self._content = await self.internal_response.read() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\client_reqrep.py", line 1037, in read self._body = await self.content.read() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 375, in read block = await self.readany() ^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 397, in readany await self._wait("readany") File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\aiohttp\streams.py", line 304, in _wait await waiter aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\CsuEnterpriseSearch\scripts\prepdocs.py", line 256, in loop.run_until_complete(main(file_strategy, azd_credential, args)) File "C:\Users\ChoubeTK\AppData\Local\Programs\Python\Python311\Lib\asyncio\base_events.py", line 650, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts\prepdocs.py", line 131, in main await strategy.run(search_info) File "C:\CsuEnterpriseSearch\scripts\prepdocslib\filestrategy.py", line 56, in run pages = [page async for page in self.pdf_parser.parse(content=file.content)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts\prepdocslib\filestrategy.py", line 56, in pages = [page async for page in self.pdf_parser.parse(content=file.content)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts\prepdocslib\pdfparser.py", line 82, in parse form_recognizer_results = await poller.result() ^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling_async_poller.py", line 179, in result await self.wait() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling_async_poller.py", line 191, in wait await self._polling_method.run() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 89, in run await self._poll() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 118, in _poll await self.update_status() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 140, in update_status self._pipeline_response = await self.request_status(self._operation.get_polling_url()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\polling\async_base_polling.py", line 174, in request_status await self._client._pipeline.run( # pylint: disable=protected-access File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 221, in run return await first_node.send(pipeline_request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [Previous line repeated 2 more times] File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_redirect_async.py", line 73, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_retry_async.py", line 205, in send raise err File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_retry_async.py", line 179, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\policies_authentication_async.py", line 94, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 69, in send response = await self.next.send(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [Previous line repeated 3 more times] File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline_base_async.py", line 106, in send await self._sender.send(request.http_request, **request.context.options), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 294, in send await response.load_body() File "C:\CsuEnterpriseSearch\scripts.venv\Lib\site-packages\azure\core\pipeline\transport_aiohttp.py", line 488, in load_body raise IncompleteReadError(err, error=err) from err azure.core.exceptions.IncompleteReadError: Response payload is not completed

TarunKC261 commented 10 months ago

It completes extraction for one of the pdf.But throws error while doing extraction for second pdf as shown in log above.

vicky002 commented 10 months ago

Hello, I'm also getting the same error.

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last):

YIN-Renlong commented 9 months ago

Same problem after running ./scripts/prepdocs.sh.

In my case, it happens when ingest some PDFs with larger pages (such as an entire 300-page book). It will get stuck on the following prompt for about 5-15 minutes before the error happens:

Extracting text from './data/demobook.pdf' using Azure Document Intelligence

here is the full log:

Extracting text from './data/demobook.pdf' using Azure Document Intelligence Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 501, in load_body self._content = await self.internal_response.read() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1100, in read self._body = await self.content.read() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 373, in read block = await self.readany() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 395, in readany await self._wait("readany") File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/aiohttp/streams.py", line 302, in _wait await waiter aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Applications/azure-ai-research/./scripts/prepdocs.py", line 256, in loop.run_until_complete(main(file_strategy, azd_credential, args)) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete return future.result() File "/Applications/azure-ai-research/./scripts/prepdocs.py", line 131, in main await strategy.run(search_info) File "/Applications/azure-ai-research/scripts/prepdocslib/filestrategy.py", line 56, in run pages = [page async for page in self.pdf_parser.parse(content=file.content)] File "/Applications/azure-ai-research/scripts/prepdocslib/filestrategy.py", line 56, in pages = [page async for page in self.pdf_parser.parse(content=file.content)] File "/Applications/azure-ai-research/scripts/prepdocslib/pdfparser.py", line 82, in parse form_recognizer_results = await poller.result() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/_async_poller.py", line 179, in result await self.wait() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/_async_poller.py", line 191, in wait await self._polling_method.run() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 89, in run await self._poll() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 118, in _poll await self.update_status() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 140, in update_status self._pipeline_response = await self.request_status(self._operation.get_polling_url()) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/polling/async_base_polling.py", line 174, in request_status await self._client._pipeline.run( # pylint: disable=protected-access File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 221, in run return await first_node.send(pipeline_request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) [Previous line repeated 2 more times] File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_redirect_async.py", line 73, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_retry_async.py", line 205, in send raise err File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_retry_async.py", line 179, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/policies/_authentication_async.py", line 94, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 69, in send response = await self.next.send(request) [Previous line repeated 3 more times] File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/_base_async.py", line 106, in send await self._sender.send(request.http_request, **request.context.options), File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 311, in send await response.load_body() File "/Applications/azure-ai-research/scripts/.venv/lib/python3.9/site-packages/azure/core/pipeline/transport/_aiohttp.py", line 505, in load_body raise IncompleteReadError(err, error=err) from err azure.core.exceptions.IncompleteReadError: Response payload is not completed

Any solution? thanks

hammad26 commented 7 months ago

While using Azure AI Document Intelligence, I am facing the similar issue:

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
An error occurred: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:  (FailedToSerializeAnalyzeResult) Failed to serialize analyze results, please contact support.
    Code: FailedToSerializeAnalyzeResult
    Message: Failed to serialize analyze results, please contact support.

Is this issue still under consideration for resolution? Thanks

pamelafox commented 7 months ago

@hammad26 Are you able to email the PDF where you experienced the issue to pamelafox@ microsoft.com? If I can replicate the error, then I can more easily share it with the Document Intelligence team. Otherwise, please indicate the size of the PDF file that caused the error.

hammad26 commented 6 months ago

@pamelafox I have just sent you the problematic document.

pamelafox commented 6 months ago

Update: The Document Intelligence team is now investigating.

hammad26 commented 6 months ago

@pamelafox Any updates on the investigation? Thanks

El-Brabo commented 5 months ago

@pamelafox I am facing the same issue. Any updates? Many thanks in advance.

ardab commented 4 months ago

@pamelafox Same for us when processing Excel files of a certain size. Workaround we have is to split the excels into multiple ones.

jstrugnell commented 4 months ago

Hi. Is there any update on this issue, or workaround please? I'm hitting the same problem, with larger PDFs, which includes some of the files in the sample dataset. Interestingly I took the "role_library.pdf" document, which has 31 pages, and extracted shortened versions of the document. When the document had 20, 25 and 30 pages, the scripts would process them successfully. So it seems like, at least in the case of that document, 30 pages was the tipping point. Though I'm sure that could vary depending on the type of content on the pages. I need to work with documents much larger than this and can't just split them up into smaller documents unfortunately. Thanks.

jstrugnell commented 4 months ago

Just tried a different PDF. Worked at 30 pages, failed at 31.

jacob-roach-hike2 commented 4 months ago

I'm seeing this same exception as well when trying to parse longer documents. We've validated that we are able to parse shorter documents (both .pdf and .docx files). Is there a root cause for this issue?

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
        Target: 0

Occasionally, we'll also encounter a 403 error when attempting to parse longer documents. This looks like this:

Traceback (most recent call last):
  File "/home/gptadmin/Hike2/scripts/document_intelligence__scratch.py", line 17, in <module>
    parsed_content: str = parse_text_from_pdf__azure(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gptadmin/FlaskApps/SL_APP/helpers/azure_helpers.py", line 48, in parse_text_from_pdf__azure
    poller = document_intelligence_client.begin_analyze_document(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/azure/core/tracing/decorator.py", line 76, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gptadmin/.local/lib/python3.11/site-packages/azure/ai/documentintelligence/_operations/_operations.py", line 3627, in begin_analyze_document
    raw_result = self._analyze_document_initial(  # type: ignore
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gptadmin/.local/lib/python3.11/site-packages/azure/ai/documentintelligence/_operations/_operations.py", line 518, in _analyze_document_initial
    raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (403) Public access is disabled. Please configure private endpoint.
Code: 403
Message: Public access is disabled. Please configure private endpoint.
pamelafox commented 3 months ago

Hi all, if you are still having issues, please email me a document if you are able to share one (pamelafox@ microsoft .com) - the team hasn't been able to replicate it recently, so we need to figure out a way to replicate it.

jacob-roach-hike2 commented 3 months ago

Hi all, if you are still having issues, please email me a document if you are able to share one (pamelafox@ microsoft .com) - the team hasn't been able to replicate it recently, so we need to figure out a way to disable it.

Unfortunately, I cannot share a document (confidential). However, I can confirm that I am seeing the following error again (as of this morning).

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
        Target: 0
pbkowalski commented 3 months ago

I am regularly encountering the same problem. However, it would seem that the larger documents will sometimes work just fine and at other times throw this error - this applies also to the example documents in this repo. Typically the only solution is retrying again later... which would suggest some internal issue with Azure Document Intelligence which would be difficult to reproduce.

jacob-roach-hike2 commented 3 months ago

I am regularly encountering the same problem. However, it would seem that the larger documents will sometimes work just fine and at other times throw this error - this applies also to the example documents in this repo. Typically the only solution is retrying again later... which would suggest some internal issue with Azure Document Intelligence which would be difficult to reproduce.

Agreed, I have had the same experience. I didn't receive this error for over a week, and then this morning, I'm seeing it again. Unfortunately, my team is using Document Intelligence in a production-workflow, meaning we can't experience this sort of unpredictable downtime.

@pamelafox, when can we expect a resolution to this issue?

laneparton commented 3 months ago

Just weighing in with my own experience - yesterday I observed this issue all day (with many re-attempts of the same files - a large PDF).

Today, the same files were ingested with no issues.

jacob-roach-hike2 commented 3 months ago

Just weighing in with my own experience - yesterday I observed this issue all day (with many re-attempts of the same files - a large PDF).

Today, the same files were ingested with no issues.

This is the exact behavior that I observed as well. @pamelafox, do you have a root cause on why this might be the case?

pamelafox commented 3 months ago

Not yet, sorry! I was sent an example document to replicate earlier this week, so I will try to replicate with that today/tomorrow.

pamelafox commented 3 months ago

I was able to replicate the error from @jacob-roach-hike2 -

Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
(InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
        Target: 0

I've sent the document, code, and error to the Document Intelligence team for them to hopefully replicate as well.

danielbichuetti commented 2 months ago

Hi, @pamelafox. What we've detected after plenty of internal tests is that large PDF files associated with the formula detection feature make the Document Intelligence service crash somehow. After we removed it, it started working nicely again.

PS: I'm flagging into this sample repo because we found it about the same issue we were facing.

Hiba13197 commented 2 months ago

Any updates regarding this issue? @pamelafox

drajinvites82 commented 2 months ago

Hi, @pamelafox

I am attaching a sample document that is creating the error for your troubleshooting purpose. Hope this is helpful.

Artificial Intelligence - A Modern Approach.pdf

benoit360l commented 1 month ago

@pamelafox Faced similar issue with Azure Document Intelligence. Method I am using to call doc intelligence is

async def _get_result_from_document_intelligence(path: str):
    document_intelligence_client = DocumentIntelligenceClient(
        AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT, AzureKeyCredential(DOCUMENT_INTELLIGENCE_API_KEY)
    )

    with open(path, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
        )

    response = await asyncio.to_thread(poller.result)
    return response
sarbjitsg commented 1 week ago

@pamelafox sent you the file with the same issues everyone is having above. . . . Ingesting 'CS-25-Amendment-27.pdf' Extracting text from './data/CS-25-Amendment-27.pdf' using Azure Document Intelligence Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object Traceback (most recent call last): File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/aiohttp/client_proto.py", line 94, in connection_lost uncompleted = self._parser.feed_eof() ^^^^^^^^^^^^^^^^^^^^^^^ File "aiohttp/_http_parser.pyx", line 516, in aiohttp._http_parser.HttpParser.feed_eof aiohttp.http_exceptions.ContentLengthError: 400, message: Not enough data for satisfy content length header.

. . . .