How to disable OCR in prepdocs script?

Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.

MIT License

6.37k stars 4.25k forks source link

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run prepdocs.sh script on some PDF files that contain images
Text from images embedded gets indexed

Any log messages given by the failure

n/a

Expected/desired behavior

I'd like to have a way to disable OCR of the images embedded in PDF files. Our use case is the application and training documentation that includes screenshots of application screens with random/example data displayed and we don't want it to be in the index.

OS and Version?

Linux Ubuntu

Versions

2024-08-23

Azure-Samples / azure-search-openai-demo