Simplification Quest - Githubissues

thracian2015 commented 2 months ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Thank you for this sample. I'm new to LLM and RAG. I'm learning and have a few questions to understand tradeoffs if I am to simplify the architecture:

Why do we need to prep the documents using custom code (paging, sectioning, Document Intelligence) instead of using the import wizard in AI Search, which also can be scheduled to crawl the content?

If we are to prep documents, what features does Document Intelligence have compared to for example Python PDFReader?

Why do we use the LLM embedding model instead of Semantic Ranker in AI Search to find document similarity?

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

pamelafox commented 2 months ago

This repository also supports integrated vectorization, which may be what that wizard uses (I'm not sure). The code in this repository is more customized, and more customizable, so it also includes features like including the page number of parsed PDF chunks, and being able to parse more types of documents. If the wizard works fine for you, then you can use it, but you'd need to make sure the search fields matched what the search function uses here.
Document Intelligence can also OCR images, not sure that the Python reader can do it. It also affects a wide range of document types now, like docx and ppts. If you haven't read it yet, the data ingestion guide has more info: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/data_ingestion.md
We use embedding models for embedding, and we use semantic ranker for re-ranking once we get best search results (it's also called an "L2 ranker" because it only ranks once given a list and a query).

My talk here might help: https://youtube.com/live/vuOA13Y_Qzk?feature=share

thracian2015 commented 2 months ago

Thank you, Pamela. Yes, I've watched your videos and thank you so much for sharing knowledge!

My research shows that the new Import and Vectorize Wizard can save tons of code to:

Extract and chunk text.
Process documents of different types stored in the same folder.
Generate embedding vectors by integrating with Azure OpenAI embedding model, such as text-embedding-ada-002.
Refresh the index on schedule.

I'll be great if you could evaluate the wizard and share your insights in another video. A low(er) code approach would be appreciated by many for sure. I'd love to also see a better integration between Azure AI Search and Azure OpenAI, such as you just submit the questions to Azure AI Search and you get the Azure OpenAI responses by integrating with Azure OpenAI text deployment and thus avoiding the two-step custom code (similar to what the wizard does with embeddings).

Azure-Samples / azure-search-openai-demo

Simplification Quest #1839

Please provide us with the following information:

This issue is for a: (mark with an `x`)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

Azure-Samples / azure-search-openai-demo

Simplification Quest #1839

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

This issue is for a: (mark with an `x`)