AnswerDotAI / byaldi

Use late-interaction multi-modal models such as ColPali in just a few lines of code.
Apache License 2.0
318 stars 31 forks source link

Using Existing Index Results in Empty Index #8

Closed ncoop57 closed 1 week ago

ncoop57 commented 1 week ago

When trying to reuse an existing index I created, I found I got the following error:

[/usr/local/lib/python3.10/dist-packages/colpali_engine/trainer/retrieval_evaluator.py](https://localhost:8080/#) in evaluate_colbert(self, qs, ps, batch_size)
     59                 ).to("cuda")
     60                 scores_batch.append(torch.einsum("bnd,csd->bcns", qs_batch, ps_batch).max(dim=3)[0].sum(dim=2))
---> 61             scores_batch = torch.cat(scores_batch, dim=1).cpu()
     62             scores.append(scores_batch)
     63         scores = torch.cat(scores, dim=0)

RuntimeError: torch.cat(): expected a non-empty list of Tensors

If I set overwrite=True when indexing my pdfs this does not happen. Here is a colab to reproduce: https://colab.research.google.com/drive/1E7I9pki9SiwPs-TsyYIg9E_DIXsEYvy6?usp=sharing

bclavie commented 1 week ago

Thanks for reporting!

This is actually an interesting edge-case, I'm not sure what the best behaviour would be here 🤔

The issue occurs because:

class ZoteroApp:
    def __init__(self, model_name, pdfs_folder):
        download_pdfs(pdfs_folder)
        self.rag_model = RAGMultiModalModel.from_pretrained(model_name)
        self.rag_model.index(input_path=pdfs_folder, index_name="zotero_papers", store_collection_with_index=True, overwrite=False)

This creates a new instance of rag_model, and tries to create an index with. Calling it twice in a row results in the second call starting a new model instance, and trying to create an index in the same location. As overwrite is False, doing so doesn't do anything (hence the message:

An index named zotero_papers already exists.
Use overwrite=True to delete the existing index and build a new one.
Exiting indexing without doing anything...

)

So when you try to query the index with the new instance, nothing actually happens, because it's not loaded an index. The best (and currently only practice) to re-use an index is to initialise RAG with the from_index() method, i.e. in your case modifying ZoteroApp to do this:

class ZoteroApp:
    def __init__(self, model_name, pdfs_folder):
        download_pdfs(pdfs_folder)
        index_name = "zotero_papers"
        index_path = os.path.join(".byaldi", index_name)
        if os.path.exists(index_path):
            self.rag_model = RAGMultiModalModel.from_index(index_path)
        else:
            self.rag_model = RAGMultiModalModel.from_pretrained(model_name)
            self.rag_model.index(input_path=pdfs_folder, index_name=index_name, store_collection_with_index=True, overwrite=False)

    def query(self, user_query, k=3):
        results = self.rag_model.search(user_query, k=k)
        return results

should fix the issue (can't run now, I have to head out soon and I'm maxed out on open colab environments), since it'll load the index if it's present (and it is when the second initialisation is called).

ncoop57 commented 1 week ago

I think just a simple error msg should suffice. I'll open a PR <3

wajeeha77 commented 1 week ago

facing same issue, cant loaded already computed index and use it, have to create index over and over again

bclavie commented 1 week ago

This is addressed in https://github.com/AnswerDotAI/byaldi/pull/12 and upcoming associated release 0.0.3. This'll now ValueError rather than just return None with a cutesy print().