Open adrianruchti opened 4 months ago
Our splitting algorithm already chunks based on a max token count. Can you clarify further what enhancement you're requesitng? What is the record manager for?
Good morning Pamela, thank you for getting back to me. I am referring to indexing chunks. In the langchain world you would define the vectorstore:
index_name: str = "langchain-vector-demo" vector_store: AzureSearch = AzureSearch( azure_search_endpoint=vector_store_address, azure_search_key=vector_store_password, index_name=index_name, embedding_function=embeddings.embed_query, )
import a SQL record manager
from langchain.indexes import SQLRecordManager, index
define the record manager
namespace = f"azure/{COLLECTION_NAME}" record_manager = SQLRecordManager( namespace, db_url=CONNECTION_STRING ) record_manager.create_schema()
then ingest the docs using the index
index( docs, record_manager, vector_store, cleanup=incremental, source_id_key="source", )
This issue is for a: (mark with an
x
)Minimal steps to reproduce
SQL DB setup required aside from AzureSearch
namespace = f"elasticsearch/{collection_name}" record_manager = SQLRecordManager( namespace, db_url="sqlite:///record_manager_cache.sql" )
record_manager.create_schema()
doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"}) doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})
"incremental" deletion mode _clear()
index( [doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source", )
{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing again should result in both documents getting skipped -- also skipping the embedding operation!
index( [doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source", )
Any log messages given by the failure
Expected/desired behavior
indexing on chunk level not on entire document level. This would avoid costs for reindexing large files if only a part was modified. Sql Db required for the record manager.
OS and Version?
azd version?
Versions
Mention any other details that might be useful