bclavie / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
2.52k stars 178 forks source link

error on using add_to_index() sequentially #174

Open ksadov opened 4 months ago

ksadov commented 4 months ago

After pulling in the most recent change, when I run the following script:

from ragatouille import RAGPretrainedModel

def main():
    RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

    documents = [f"document {i}" for i in range(5000)]
    RAG.index(
        collection=documents,
        index_name="demo",
    )

    for i in range(10):
        new_documents = [f"wefwfvaeves {i}"]

        # Add documents to the index
        RAG.add_to_index(new_documents)
        result = RAG.search(query="wefwfvaeves {i}", k=3)
        print("RESULTS", result)

    RAG2 = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/demo/")
    results2 = RAG2.search(query="sfeqsbsrfgb", k=3)
    print("RESULTS from loaded index", results2)

if __name__ == "__main__":
    main()

The first add_to_index() call in the loop successfully indexes and retrieves the given document. However, the second call results in the following error:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [94,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [9,0,0], thread: [95,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/home/ksadov/retrieval-experiments/minimal_rag.py", line 27, in <module>
    main()
  File "/home/ksadov/retrieval-experiments/minimal_rag.py", line 18, in main
    result = RAG.search(query="wefwfvaeves {i}", k=3)
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/ragatouille/RAGPretrainedModel.py", line 311, in search
    return self.model.search(
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/ragatouille/models/colbert.py", line 387, in search
    results = self.model_index.search(
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/ragatouille/models/index.py", line 317, in search
    results = [self._search(query, k, pids)]
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/ragatouille/models/index.py", line 252, in _search
    return self.searcher.search(query, k=k, pids=pids)
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/colbert/searcher.py", line 67, in search
    return self.dense_search(Q, k, filter_fn=filter_fn, pids=pids)
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/colbert/searcher.py", line 129, in dense_search
    pids, scores = self.ranker.rank(self.config, Q, filter_fn=filter_fn, pids=pids)
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/colbert/search/index_storage.py", line 116, in rank
    scores, pids = self.score_pids(config, Q, pids, centroid_scores)
  File "/home/ksadov/retrieval-experiments/venv/lib/python3.9/site-packages/colbert/search/index_storage.py", line 149, in score_pids
    idx_ = idx[codes_packed.long()]
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This is with Python 3.9.13 and faiss-gpu 1.7.3, though it seems to me like the error is caused by a failure to update some mapping or search struct internal to the RAGatouille library and not one of its dependencies.

bclavie commented 4 months ago

Hey, thanks for flagging

@jlscheerer @Anmol6, would you have a minute to take a look? It seems like there's some sort of indexing issue somewhere with the CRUD functionalities in colbert 🤔

ksadov commented 4 months ago

Seems like it could be an issue in the colbert repo actually, https://github.com/stanford-futuredata/ColBERT/issues/261 looks related

TakshPanchal commented 3 months ago

As I mentioned in the stanford-futuredata/ColBERT#261, the problem is IndexUpdater.persist_to_disk updates only embedding vectors. Colbert's index folder has collection.json file in which all the docs are saved. IndexUpdater.persist_to_disk should also update those collections, after updating, the index searcher should be updated with the latest collection.