Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.71k stars 3.85k forks source link

ResourceNotFoundError in predocs phase #1528

Open philipho11 opened 3 months ago

philipho11 commented 3 months ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [X] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

azd provision (and all resources are created fine, i think)

Any log messages given by the failure

Packaging services (azd package)

(✓) Done: Packaging service backend

SUCCESS: Your application was packaged for Azure in 51 seconds. Checking if authentication should be setup... Loading azd .env file from current environment... AZURE_USE_AUTHENTICATION is set, proceeding with authentication setup... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Not setting up authentication.

Provisioning Azure resources (azd provision) Provisioning Azure resources can take some time.

Subscription: icdev (bb369bdb-6d2a-483a-ac31-fa61b10cacfa) Location: East Asia

(-) Skipped: Didn't find new changes. Loading azd .env file from current environment... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Not updating authentication. Loading azd .env file from current environment... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Running "prepdocs.py" Using local files: ./data/ Ensuring search index gptkbindex exists Search index gptkbindex already exists Skipping ./data/employee_handbook.pdf, no changes detected. Ingesting '2190.json' Splitting '2190.json' into sections Uploading blob for whole file -> 2190.json Computed embeddings in batch. Batch size: 2, Token count: 303 Ingesting 'query.json' Splitting 'query.json' into sections Uploading blob for whole file -> query.json Computed embeddings in batch. Batch size: 7, Token count: 2199 Ingesting '2192.json' Splitting '2192.json' into sections Uploading blob for whole file -> 2192.json Computed embeddings in batch. Batch size: 2, Token count: 375 Ingesting '2191.json' Splitting '2191.json' into sections Uploading blob for whole file -> 2191.json Computed embeddings in batch. Batch size: 2, Token count: 418 Ingesting '2189.json' Splitting '2189.json' into sections Uploading blob for whole file -> 2189.json Computed embeddings in batch. Batch size: 1, Token count: 205 Ingesting 'Benefit_Options.pdf' Extracting text from './data/Benefit_Options.pdf' using Azure Document Intelligence Traceback (most recent call last): File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 494, in loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall)) File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 225, in main await strategy.run() File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 84, in run sections = await parse_file(file, self.file_processors, self.category, self.image_embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 26, in parse_file pages = [page async for page in processor.parser.parse(content=file.content)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 26, in pages = [page async for page in processor.parser.parse(content=file.content)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/pdfparser.py", line 54, in parse poller = await document_intelligence_client.begin_analyze_document( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer return await func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/ai/documentintelligence/aio/_operations/_operations.py", line 3241, in begin_analyze_document raw_result = await self._analyze_document_initial( # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/ai/documentintelligence/aio/_operations/_operations.py", line 130, in _analyze_document_initial map_error(status_code=response.status_code, response=response, error_map=error_map) File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/core/exceptions.py", line 164, in map_error raise error azure.core.exceptions.ResourceNotFoundError: (404) Resource not found Code: 404 Message: Resource not found

ERROR: failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1942051034.sh'. : exit code: 1

ERROR: error executing step command 'provision': failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1942051034.sh'. : exit code: 1

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Github codespaces with python 3.11

azd version?

run azd version and copy paste here. azd version 1.8.0 (commit 8246323c2472148288be4b3cbc3c424bd046b985)

Versions

main branch as of April 12, 2024.

Mention any other details that might be useful


Thanks! We'll be in touch soon.

Caden-Ertel commented 3 months ago

Make sure your document intelligence service in one of these regions : westus2, eastus, or westeurope

jimmylevell commented 2 months ago

In my case the issue only occurs with the latest azure-ai-documentintelligence package and private endpoints presents. Using the older package azure.ai.formrecognizer seems to solve the issue.

pamelafox commented 2 months ago

@jimmylevell You say you're using private endpoints? Did you add those manually in the Portal? We have a PR which adds support for private endpoints, but it's not yet in main. We're still testing that, so I don't know if we've seen issues with using Document Intelligence. FYI to @mattgotteiner who's working on that PR.

jimmylevell commented 2 months ago

@pamelafox thank you for your fast reply. We needed to deploy the solution based on internally policies manually within Azure. In this process all Azure resources have been configured with private endpoints. The solution is working as expected within our tenant (Azure Switzerland North). The only required change we needed to introduce was reverting to an older form recognizer dependency in the pdfparser.py.

- from azure.ai.documentintelligence.aio import DocumentIntelligenceClient + from azure.ai.formrecognizer.aio import DocumentAnalysisClient

The issue with the form recognizer can also be illustrated using the following demo-code:

Working Sample

Based on Document Intelligence Studio Sample

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = "https://<private-endpoint-instance>.cognitiveservices.azure.com/"
key = ""

formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"

document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )

poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-document", formUrl)
result = poller.result()

for kv_pair in result.key_value_pairs:
    if kv_pair.key and kv_pair.value:
        print("Key '{}': Value: '{}'".format(kv_pair.key.content, kv_pair.value.content))
    else:
        print("Key '{}': Value:".format(kv_pair.key.content))

None-working Sample

Based on MS Docs Sample

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult

endpoint = "https://<same private endpoint instance>.cognitiveservices.azure.com/"
key = ""

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("./data/OHB5336.pdf", "rb") as f:
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
    )
result: AnalyzeResult = poller.result()

Therefore, I believe the issue is related with the newer document intelligence package.

datkinson-mdvip commented 2 months ago

I'm seeing this too, freshly downloaded repository.

pamelafox commented 2 months ago

Hm, so we've gotten private endpoints working in PR here: https://github.com/Azure-Samples/azure-search-openai-demo/pull/864/files So I'm looking to see what Document Intelligence specific changes are in there that might be relevant. There's this configuration of network bypass: https://github.com/Azure-Samples/azure-search-openai-demo/pull/864/files#diff-8a64001dc63e4053382af7bbd6519e074e3a637a9e0a50b5b6e8ca136b4224ceR37 That's the only change I see specific to Document Intelligence.

We don't change our URL to Document Intelligence, did you change your URL? I don't believe that should be necessary.

(I am still learning Azure networking so I may be wrong)

cc @mattgotteiner in case he has insights

jimmylevell commented 2 months ago

Just to clarify we have the main branch of the app running our Azure instructure. Each resource only uses private endpoints. As mentioned we have configured each resource manually, but no code changes were required (besides the mentioned form recognizer library). Let me know if any further information would help your PR.

What is weird is that that I am using the same configuration once with the form recognizer library and once with the document intelligence library. The later throwing the error ResourceNotFound. I am accessing the default form recognizer url provided in the Azure portal, which is pointing to the private endpoint IP.

(same here)

tylorbunting commented 2 months ago

Had the same issue. Root-cause is due to deploying the Document Intelligence resource into the Australia East environment (need for client demonstration). Resolution was to implement the similar solution to what was recommended here by @jimmylevell :)

Replacing the Document Intelligence related components (i.e. DocumentIntelligenceClient and DocumentTable modules) with Form Recognizer components (i.e. DocumentAnalysisClient and FormTable) with the below code:

from azure.ai.formrecognizer.aio import DocumentAnalysisClient
from azure.ai.formrecognizer import FormTable

...

async with DocumentAnalysisClient(
            endpoint=self.endpoint, credential=self.credential
        ) as document_analysis_client:
            poller = await document_analysis_client.begin_analyze_document(
                model_id=self.model_id, document=content
            )
            form_recognizer_results = await poller.result()

            offset = 0
            for page_num, page in enumerate(form_recognizer_results.pages):
                tables_on_page = [
                    table
                    for table in (form_recognizer_results.tables or [])
                    if table.bounding_regions and table.bounding_regions[0].page_number == page_num + 1
                ]

                # mark all positions of the table spans in the page
                page_offset = page.spans[0].offset
                page_length = page.spans[0].length
                table_chars = [-1] * page_length
                for table_id, table in enumerate(tables_on_page):
                    for span in table.spans:
                        # replace all table spans with "table_id" in table_chars array
                        for i in range(span.length):
                            idx = span.offset - page_offset + i
                            if idx >= 0 and idx < page_length:
                                table_chars[idx] = table_id

                # build page text by replacing characters in table spans with table html
                page_text = ""
                added_tables = set()
                for idx, table_id in enumerate(table_chars):
                    if table_id == -1:
                        page_text += form_recognizer_results.content[page_offset + idx]
                    elif table_id not in added_tables:
                        page_text += DocumentAnalysisParser.table_to_html(tables_on_page[table_id])
                        added_tables.add(table_id)

                yield Page(page_num=page_num, offset=offset, text=page_text)
                offset += len(page_text)

    @classmethod
    def table_to_html(cls, table: FormTable):
        table_html = "<table>"
        rows = [
            sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index)
            for i in range(table.row_count)
        ]
        for row_cells in rows:
            table_html += "<tr>"
            for cell in row_cells:
                tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
                cell_spans = ""
                if cell.column_span is not None and cell.column_span > 1:
                    cell_spans += f" colSpan={cell.column_span}"
                if cell.row_span is not None and cell.row_span > 1:
                    cell_spans += f" rowSpan={cell.row_span}"
                table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
            table_html += "</tr>"
        table_html += "</table>"
        return table_html