A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
This is replicable with the sample data. When running prepdocs, you'll see that several pages aren't represented in the sections uploaded, in that there are no sections with corresponding sourcepage equal to that page number, and thus no sections with an imageEmbedding corresponding to that sourcepage. That means some answers may be lower quality, as they don't find the relevant matching image.
Possible approaches:
For certain document types, like slides, never chunk sections across pages. This was my original idea but then realized our sample document was a slide exported as a PDF, so I couldn't have a PPT-dependent condition. Thus, this isn't a full solution.
Never let sections go across pages. This may not work well with many PDFs like research papers that legitimately have sections go across pages.
Associate multiple sourcepage's with a single section. @mattgotteiner says that's possible by picking a delimeter. Not sure if multiple imageEmbedding's would also be possible? Otherwise we'd have to pick which imageEmbedding we thought was best.
This is replicable with the sample data. When running prepdocs, you'll see that several pages aren't represented in the sections uploaded, in that there are no sections with corresponding sourcepage equal to that page number, and thus no sections with an imageEmbedding corresponding to that sourcepage. That means some answers may be lower quality, as they don't find the relevant matching image.
Possible approaches: