Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.07k
stars
210
forks
source link
Individual documents appending with add_to_index #248
I would like to know if it is possible for add_to_index to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.
I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.
What I would like to do is index documents one by one using add_to_index, since I don’t want to have 100k documents in memory. Is this possible?
(I’m aware that the add_to_index function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)
My code:
def load_rag(self, index_name):
index_path = f".ragatouille/colbert/indexes/{index_name}/"
return RAGPretrainedModel.from_index(index_path)
def add_document(self, index_name, chunks, document_id, url):
try:
RAG = self.load_rag(index_name)
RAG.add_to_index(chunks, new_document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks), split_documents=False)
except FileNotFoundError:
logging.info(f"🔔 There are no documents in the index {index_name}, the index will be created")
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(
collection=chunks,
document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks),
index_name=index_name,
split_documents=False
)
Logs of individual appending (recalculation of embeddings):
I would like to know if it is possible for
add_to_index
to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.
What I would like to do is index documents one by one using
add_to_index
, since I don’t want to have 100k documents in memory. Is this possible?(I’m aware that the
add_to_index
function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)My code:
Logs of individual appending (recalculation of embeddings):