Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.86k stars 4k forks source link

Implementation of Langchain indexing or any other similar solution for indexing on chunk level, not on entire document level #1808

Open adrianruchti opened 1 month ago

adrianruchti commented 1 month ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ X] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

SQL DB setup required aside from AzureSearch

Langchain Compatible Vectorstores: AnalyticDB, AstraDB, AzureCosmosDBVectorSearch, AzureSearch, from langchain.indexes import SQLRecordManager, index from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings

namespace = f"elasticsearch/{collection_name}" record_manager = SQLRecordManager( namespace, db_url="sqlite:///record_manager_cache.sql" )

record_manager.create_schema()

doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"}) doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

"incremental" deletion mode _clear()

index( [doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source", )

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Indexing again should result in both documents getting skipped -- also skipping the embedding operation!

index( [doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source", )

Any log messages given by the failure

Expected/desired behavior

indexing on chunk level not on entire document level. This would avoid costs for reindexing large files if only a part was modified. Sql Db required for the record manager.

OS and Version?

macOS (Ventura)

azd version?

run azd version 1.9.5.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

pamelafox commented 1 month ago

Our splitting algorithm already chunks based on a max token count. Can you clarify further what enhancement you're requesitng? What is the record manager for?

adrianruchti commented 1 month ago

Good morning Pamela, thank you for getting back to me. I am referring to indexing chunks. In the langchain world you would define the vectorstore: index_name: str = "langchain-vector-demo" vector_store: AzureSearch = AzureSearch( azure_search_endpoint=vector_store_address, azure_search_key=vector_store_password, index_name=index_name, embedding_function=embeddings.embed_query, ) import a SQL record manager from langchain.indexes import SQLRecordManager, index define the record manager namespace = f"azure/{COLLECTION_NAME}" record_manager = SQLRecordManager( namespace, db_url=CONNECTION_STRING ) record_manager.create_schema() then ingest the docs using the index index( docs, record_manager, vector_store, cleanup=incremental, source_id_key="source", )