Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.94k stars 4.08k forks source link

Azure Document Intelligence Error: InvalidContentDimensions #1735

Open anoobbacker opened 3 months ago

anoobbacker commented 3 months ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [x ] bug report -> please search issues before submitting
- [ x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I included 1000s of files with MD file format and images to the data/ folder. azd up kept failing for every single file which has a smaller dimension. It is ok to skip those file but currently I've to delete those files to proceed with azd up.

Any log messages given by the failure

For files with very small dimensions (un-supported), the azd up kept failing and had to fix each of the files to proceed.


Ingesting 'efficiency.png'
Extracting text from './data/efficiency.png' using Azure Document Intelligence
Traceback (most recent call last):
File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 470, in <module>
loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/./app/backend/prepdocs.py", line 213, in main
await strategy.run()
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 84, in run
sections = await parse_file(file, self.file_processors, self.category, self.image_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 26, in parse_file
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/filestrategy.py", line 26, in <listcomp>
pages = [page async for page in processor.parser.parse(content=file.content)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/app/backend/prepdocslib/pdfparser.py", line 54, in parse
poller = await document_intelligence_client.begin_analyze_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/core/tracing/decorator_async.py", line 94, in wrapper_use_tracer
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/ai/documentintelligence/aio/_operations/_operations.py", line 3241, in begin_analyze_document
raw_result = await self._analyze_document_initial(  # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/azure-search-openai-demo/.venv/lib/python3.11/site-packages/azure/ai/documentintelligence/aio/_operations/_operations.py", line 132, in _analyze_document_initial
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
"code": "InvalidContentDimensions",
"message": "The input image dimensions are out of range. Refer to documentation for supported image dimensions."
}

ERROR: error executing step command 'provision': failed running post hooks: 'postprovision' hook failed with exit code: '1', Path: '/tmp/azd-postprovision-1671022109.sh'. : exit code: 1

### Expected/desired behavior
> Skip the files which has InvalidContentDimensions and provide a summary of the files skipped later. 

### OS and Version?
> GitHub Codespaces

uname -ra Linux codespaces-50920e 6.5.0-1021-azure #22~22.04.1-Ubuntu SMP Tue Apr 30 16:08:18 UTC 2024 x86_64 GNU/Linux


### azd version?
> run `azd version` and copy paste here.

azd version 1.9.3 (commit e1624330dcc7dde440ecc1eda06aac40e68aa0a3)


### Versions
>

### Mention any other details that might be useful

> ---------------------------------------------------------------
> Thanks! We'll be in touch soon.
pamelafox commented 3 months ago

Ah, thanks for filing, I haven't done much data ingestion with images. I'm a little worried that if we skipped over images, people wouldn't notice the logs. I think that'd have to be an optional flag. You should be able to modify the code to try/catch for that particular exception for your use case.

If any other developers run into this as well, please comment so we can see how often developers are using this repo with images/small images.