Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
6.16k stars 4.19k forks source link

Process fails for certain pdf document but not others, not sure why #119

Open suetholen opened 1 year ago

suetholen commented 1 year ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I included some custom pdfs in my data folder instead of the default Northwind Traders, and all worked fine except for this one: [https://www.faa.gov/sites/faa.gov/files/regulations_policies/handbooks_manuals/aviation/airplane_handbook/00_afh_full.pdf] That generated roughly 200 pages and uploaded the first 9 (0-8), but when uploading blob #9, it would fail. The pdf seems to have a blank page there with only a page number, but if I modified the script to start from page 10 it failed there also.

Any log messages given by the failure

String of errors, most recent was line 312 in prepdocs.py, earlier were azure\core\pipeline\transport

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Windows 11

Versions

Ran this on 4/17/2023. 5-6 other pdfs of similar origin worked without error, but that one did not. Curious if anyone can figure out why and what's different about that file.

Mention any other details that might be useful


Thanks! We'll be in touch soon.

jimmylevell commented 1 year ago

I have stumbled upon a similar issue. In my case I was able to solve the issue by updating the related pypdf (pypdf==3.7.1) dependency.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.