Implementation of Langchain indexing or any other similar solution for indexing on chunk level, not on entire document level

adrianruchti commented 4 months ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [ X] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

SQL DB setup required aside from AzureSearch

Langchain Compatible Vectorstores: AnalyticDB, AstraDB, AzureCosmosDBVectorSearch, AzureSearch, from langchain.indexes import SQLRecordManager, index from langchain_core.documents import Document from langchain_openai import OpenAIEmbeddings

namespace = f"elasticsearch/{collection_name}" record_manager = SQLRecordManager( namespace, db_url="sqlite:///record_manager_cache.sql" )

record_manager.create_schema()

doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"}) doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

"incremental" deletion mode _clear()

index( [doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source", )

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Indexing again should result in both documents getting skipped -- also skipping the embedding operation!

index( [doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source", )

Any log messages given by the failure

Expected/desired behavior

indexing on chunk level not on entire document level. This would avoid costs for reindexing large files if only a part was modified. Sql Db required for the record manager.

OS and Version?

macOS (Ventura)

azd version?

run azd version 1.9.5.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

pamelafox commented 4 months ago

Our splitting algorithm already chunks based on a max token count. Can you clarify further what enhancement you're requesitng? What is the record manager for?

adrianruchti commented 4 months ago

Good morning Pamela, thank you for getting back to me. I am referring to indexing chunks. In the langchain world you would define the vectorstore: index_name: str = "langchain-vector-demo" vector_store: AzureSearch = AzureSearch( azure_search_endpoint=vector_store_address, azure_search_key=vector_store_password, index_name=index_name, embedding_function=embeddings.embed_query, ) import a SQL record manager from langchain.indexes import SQLRecordManager, index define the record manager namespace = f"azure/{COLLECTION_NAME}" record_manager = SQLRecordManager( namespace, db_url=CONNECTION_STRING ) record_manager.create_schema() then ingest the docs using the index index( docs, record_manager, vector_store, cleanup=incremental, source_id_key="source", )

Azure-Samples / azure-search-openai-demo