Excel converted to PDF retrieval inaccurate

bp3000bp commented 3 weeks ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [x] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I took an Excel sheet with tables of several hundred employees contact information, and then converted that Excel sheet into a PDF to be uploaded to the data folder. The retrieval with this PDF file is extremely unreliable. I am wondering if uploading the excel sheet itself would make the responses any more accurate, or if this is an issue that is happening because of the amount of information that is stored in this one document. It is frustrating when you ask the bot, "what is the contact info for ___" and despite it being clearly listed in the PDF, the bot cannot come up with an accurate answer. How can I address this?

Any log messages given by the failure

n/a

Expected/desired behavior

Give the correct contact information

OS and Version?

windows 11

azd version?

azd version 1.9.6 (commit 0b1a8ad350177078312141f4465ad72205d7c077)

Versions

n/a

Mention any other details that might be useful

Thanks! We'll be in touch soon.

pamelafox commented 3 weeks ago

Doing RAG on tabular data can be more tricky than RAG on paragraphs/lists/etc. You might find some ideas by searching the issue tracker for excel/csv/tabular.

As for figuring out where the issue is in your answer quality, please step through the process here: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/customization.md#improving-answer-quality You can see what the retrieved chunks are, and whether they contain the answer. If they do contain the answer, then it's likely that the LLM doesn't understand the formatting, and you may want to try Excel or even converting to a Markdown table. If they don't contain the answer, then the problem is with the search step.

bp3000bp commented 3 weeks ago

The issue is not in retrieval, the retrieval brings back the correct information, so it must be in the formatting? It's confusing when this is the case, usually the issue is the retrieval in my past issues. I will try to upload the Excel document itself, but I am fearful that it's just not going to work out with this document/information specifically.

bp3000bp commented 3 weeks ago

@pamelafox instead of using excel, would using a different strategy work better? For example, making a separate document for each department, and instead of having tabular data, making a list formatted like this:

One docx file converted to PDF Accounting department contacts: name: first last, phone: 555-555-5555, email: firstlast@company.com, title: accounting coordinator name: first last, phone: 555-555-5555, email: firstlast@company.com, title: accounting coordinator name: first last, phone: 555-555-5555, email: firstlast@company.com, title: accounting coordinator

Second docx file converted to PDF Marketing department contacts: name: first last, phone: 555-555-5555, email: firstlast@company.com, title: marketing coordinator name: first last, phone: 555-555-5555, email: firstlast@company.com, title: marketing coordinator name: first last, phone: 555-555-5555, email: firstlast@company.com, title: marketing coordinator

Would this presentation of the information be easier for the LLM to retrieve and provide accurate responses for? This strategy would separate the info into several documents for different departments, containing several hundred employees. Or is this type of data just not going to work for this application?

Azure-Samples / azure-search-openai-demo