AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.07k stars 210 forks source link

Individual documents appending with add_to_index #248

Open aaraya-rr opened 2 months ago

aaraya-rr commented 2 months ago

I would like to know if it is possible for add_to_index to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.

I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.

What I would like to do is index documents one by one using add_to_index, since I don’t want to have 100k documents in memory. Is this possible?

(I’m aware that the add_to_index function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)

My code:

    def load_rag(self, index_name):
        index_path = f".ragatouille/colbert/indexes/{index_name}/"
        return RAGPretrainedModel.from_index(index_path)

    def add_document(self, index_name, chunks, document_id, url):
        try:
            RAG = self.load_rag(index_name)
            RAG.add_to_index(chunks, new_document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks), split_documents=False)
        except FileNotFoundError:
            logging.info(f"🔔 There are no documents in the index {index_name}, the index will be created")
            RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
            RAG.index(
                collection=chunks, 
                document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks),
                index_name=index_name, 
                split_documents=False
                )

Logs of individual appending (recalculation of embeddings):

[Sep 12, 18:56:39] [0]       #> Encoding 1164 passages..
[Sep 12, 18:56:45] [0]       avg_doclen_est = 208.9011993408203      len(local_sample) = 1,164
[Sep 12, 18:56:45] [0]       Creating 4,096 partitions.
[Sep 12, 18:56:45] [0]       *Estimated* 243,160 embeddings.
[Sep 12, 18:56:45] [0]       #> Saving the indexing plan to .ragatouille/colbert/indexes/colbert_debug_chunks/plan.json ..

...

[Sep 12, 18:57:02] [0]       #> Encoding 1173 passages..
[Sep 12, 18:57:08] [0]       avg_doclen_est = 208.96163940429688     len(local_sample) = 1,173
[Sep 12, 18:57:08] [0]       Creating 4,096 partitions.
[Sep 12, 18:57:08] [0]       *Estimated* 245,112 embeddings.
[Sep 12, 18:57:08] [0]       #> Saving the indexing plan to .ragatouille/colbert/indexes/colbert_debug_chunks/plan.json ..