Open philipho11 opened 3 months ago
Make sure your document intelligence service in one of these regions : westus2, eastus, or westeurope
In my case the issue only occurs with the latest azure-ai-documentintelligence package and private endpoints presents. Using the older package azure.ai.formrecognizer seems to solve the issue.
@jimmylevell You say you're using private endpoints? Did you add those manually in the Portal? We have a PR which adds support for private endpoints, but it's not yet in main. We're still testing that, so I don't know if we've seen issues with using Document Intelligence. FYI to @mattgotteiner who's working on that PR.
@pamelafox thank you for your fast reply. We needed to deploy the solution based on internally policies manually within Azure. In this process all Azure resources have been configured with private endpoints. The solution is working as expected within our tenant (Azure Switzerland North). The only required change we needed to introduce was reverting to an older form recognizer dependency in the pdfparser.py.
- from azure.ai.documentintelligence.aio import DocumentIntelligenceClient + from azure.ai.formrecognizer.aio import DocumentAnalysisClient
The issue with the form recognizer can also be illustrated using the following demo-code:
Based on Document Intelligence Studio Sample
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
endpoint = "https://<private-endpoint-instance>.cognitiveservices.azure.com/"
key = ""
formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/sample-layout.pdf"
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-document", formUrl)
result = poller.result()
for kv_pair in result.key_value_pairs:
if kv_pair.key and kv_pair.value:
print("Key '{}': Value: '{}'".format(kv_pair.key.content, kv_pair.value.content))
else:
print("Key '{}': Value:".format(kv_pair.key.content))
Based on MS Docs Sample
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
endpoint = "https://<same private endpoint instance>.cognitiveservices.azure.com/"
key = ""
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open("./data/OHB5336.pdf", "rb") as f:
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout", analyze_request=f, content_type="application/octet-stream"
)
result: AnalyzeResult = poller.result()
Therefore, I believe the issue is related with the newer document intelligence package.
I'm seeing this too, freshly downloaded repository.
Hm, so we've gotten private endpoints working in PR here: https://github.com/Azure-Samples/azure-search-openai-demo/pull/864/files So I'm looking to see what Document Intelligence specific changes are in there that might be relevant. There's this configuration of network bypass: https://github.com/Azure-Samples/azure-search-openai-demo/pull/864/files#diff-8a64001dc63e4053382af7bbd6519e074e3a637a9e0a50b5b6e8ca136b4224ceR37 That's the only change I see specific to Document Intelligence.
We don't change our URL to Document Intelligence, did you change your URL? I don't believe that should be necessary.
(I am still learning Azure networking so I may be wrong)
cc @mattgotteiner in case he has insights
Just to clarify we have the main branch of the app running our Azure instructure. Each resource only uses private endpoints. As mentioned we have configured each resource manually, but no code changes were required (besides the mentioned form recognizer library). Let me know if any further information would help your PR.
What is weird is that that I am using the same configuration once with the form recognizer library and once with the document intelligence library. The later throwing the error ResourceNotFound. I am accessing the default form recognizer url provided in the Azure portal, which is pointing to the private endpoint IP.
(same here)
Had the same issue. Root-cause is due to deploying the Document Intelligence resource into the Australia East environment (need for client demonstration). Resolution was to implement the similar solution to what was recommended here by @jimmylevell :)
Replacing the Document Intelligence related components (i.e. DocumentIntelligenceClient and DocumentTable modules) with Form Recognizer components (i.e. DocumentAnalysisClient and FormTable) with the below code:
from azure.ai.formrecognizer.aio import DocumentAnalysisClient
from azure.ai.formrecognizer import FormTable
...
async with DocumentAnalysisClient(
endpoint=self.endpoint, credential=self.credential
) as document_analysis_client:
poller = await document_analysis_client.begin_analyze_document(
model_id=self.model_id, document=content
)
form_recognizer_results = await poller.result()
offset = 0
for page_num, page in enumerate(form_recognizer_results.pages):
tables_on_page = [
table
for table in (form_recognizer_results.tables or [])
if table.bounding_regions and table.bounding_regions[0].page_number == page_num + 1
]
# mark all positions of the table spans in the page
page_offset = page.spans[0].offset
page_length = page.spans[0].length
table_chars = [-1] * page_length
for table_id, table in enumerate(tables_on_page):
for span in table.spans:
# replace all table spans with "table_id" in table_chars array
for i in range(span.length):
idx = span.offset - page_offset + i
if idx >= 0 and idx < page_length:
table_chars[idx] = table_id
# build page text by replacing characters in table spans with table html
page_text = ""
added_tables = set()
for idx, table_id in enumerate(table_chars):
if table_id == -1:
page_text += form_recognizer_results.content[page_offset + idx]
elif table_id not in added_tables:
page_text += DocumentAnalysisParser.table_to_html(tables_on_page[table_id])
added_tables.add(table_id)
yield Page(page_num=page_num, offset=offset, text=page_text)
offset += len(page_text)
@classmethod
def table_to_html(cls, table: FormTable):
table_html = "<table>"
rows = [
sorted([cell for cell in table.cells if cell.row_index == i], key=lambda cell: cell.column_index)
for i in range(table.row_count)
]
for row_cells in rows:
table_html += "<tr>"
for cell in row_cells:
tag = "th" if (cell.kind == "columnHeader" or cell.kind == "rowHeader") else "td"
cell_spans = ""
if cell.column_span is not None and cell.column_span > 1:
cell_spans += f" colSpan={cell.column_span}"
if cell.row_span is not None and cell.row_span > 1:
cell_spans += f" rowSpan={cell.row_span}"
table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
table_html += "</tr>"
table_html += "</table>"
return table_html
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
(✓) Done: Packaging service backend
SUCCESS: Your application was packaged for Azure in 51 seconds. Checking if authentication should be setup... Loading azd .env file from current environment... AZURE_USE_AUTHENTICATION is set, proceeding with authentication setup... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Not setting up authentication.
Provisioning Azure resources (azd provision) Provisioning Azure resources can take some time.
Subscription: icdev (bb369bdb-6d2a-483a-ac31-fa61b10cacfa) Location: East Asia
(-) Skipped: Didn't find new changes. Loading azd .env file from current environment... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Not updating authentication. Loading azd .env file from current environment... Creating Python virtual environment "app/backend/.venv"... Installing dependencies from "requirements.txt" into virtual environment (in quiet mode)... Running "prepdocs.py" Using local files: ./data/ Ensuring search index gptkbindex exists Search index gptkbindex already exists Skipping ./data/employee_handbook.pdf, no changes detected. Ingesting '2190.json' Splitting '2190.json' into sections Uploading blob for whole file -> 2190.json Computed embeddings in batch. Batch size: 2, Token count: 303 Ingesting 'query.json' Splitting 'query.json' into sections Uploading blob for whole file -> query.json Computed embeddings in batch. Batch size: 7, Token count: 2199 Ingesting '2192.json' Splitting '2192.json' into sections Uploading blob for whole file -> 2192.json Computed embeddings in batch. Batch size: 2, Token count: 375 Ingesting '2191.json' Splitting '2191.json' into sections Uploading blob for whole file -> 2191.json Computed embeddings in batch. Batch size: 2, Token count: 418 Ingesting '2189.json' Splitting '2189.json' into sections Uploading blob for whole file -> 2189.json Computed embeddings in batch. Batch size: 1, Token count: 205 Ingesting 'Benefit_Options.pdf' Extracting text from './data/Benefit_Options.pdf' using Azure Document Intelligence Traceback (most recent call last): File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 494, in
loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 225, in main
await strategy.run()
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 84, in run
sections = await parse_file(file, self.file_processors, self.category, self.image_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 26, in parse_file
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 26, in
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/pdfparser.py", line 54, in parse
poller = await document_intelligence_client.begin_analyze_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py", line 77, in wrapper_use_tracer
return await func( args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/ai/documentintelligence/aio/_operations/_operations.py", line 3241, in begin_analyze_document
raw_result = await self._analyze_document_initial( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/ai/documentintelligence/aio/_operations/_operations.py", line 130, in _analyze_document_initial
map_error(status_code=response.status_code, response=response, error_map=error_map)
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/core/exceptions.py", line 164, in map_error
raise error
azure.core.exceptions.ResourceNotFoundError: (404) Resource not found
Code: 404
Message: Resource not found
ERROR: failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1942051034.sh'. : exit code: 1
ERROR: error executing step command 'provision': failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1942051034.sh'. : exit code: 1
Expected/desired behavior
OS and Version?
azd version?
Versions
Mention any other details that might be useful