Unable to do Dense search against own index

sujitpal commented 3 years ago

My environment:

OS - Ubuntu 18.04
Java 11.0.11
Python 3.8.8
Python Package versions:
- torch 1.8.1
- faiss-cpu 1.7.0
- pyserini 0.12.0

Problem 1

I followed instructions to create my own minimal index and was able to run the Sparse Retrieval example successfully. However, when I tried to run the Dense retrieval example using the TctColBertQueryEncoder, I encountered the following issues that seem to be caused by me having a newer version of the transformers library, where the requires_faiss and requires_pytorch methods have been replaced with a more general requires_backends method in transformers.file_utils. The following files were affected.

pyserini/dsearch/_dsearcher.py
pyserini/dsearch/_model.py

Problem 2

Replacing them in place in the Pyserini code in my site-packages allowed me to move forward, but now I get the error message:

RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /__w/faiss-wheels/faiss-wheels/faiss/faiss/impl/io.cpp:81: Error: 'f' failed: could not open /path/to/lucene_index/index for reading: No such file or directory

The /path/to/lucene_index above is a folder where my lucene index was built using pyserini.index. I am guessing that an additional ANN index might be required to be built from the data to allow Dense searching to happen? I looked in the help for pyserini.index but there did not seem to be anything that indicated creation of ANN index.

I can live with the first problem (since I have a local solution) but obviously some fix to that would be nice. For the second problem, some documentation or help with building a local index for dense searching will be very much appreciated.

Thanks!

lintool commented 3 years ago

HI @sujitpal thanks for your issue.

Re: Problem 1, I think this is a known issue, @MXueguang can help with that.

Re: Problem 2, yes, you'll need to build a dense index, but currently Pyserini does not support "dense indexing" your own custom collections. The reason is that transformer encoders are trained to be collection specific - so, for example, using tct_colbert on your collection out of the box will likely lead to terrible results. This is a known issue that the community is working on, but we're basically talking about an open research problem. Training dense retrieval models (from scratch) is beyond the scope of Pyserini - for that, you'll have to look at the DPR repo https://github.com/facebookresearch/DPR or something similar.

Hope this helps!

MXueguang commented 3 years ago

Hi @sujitpal

Problem1 is related to https://github.com/castorini/pyserini/issues/567, we currently fix it by restricting transformers<=4.5.0.

about Problem2. Currently, we haven't integrated the dense indexing process into Pyserini yet. I am working on that https://github.com/castorini/pyserini/issues/497 But we do have some scripts right now that may unblock you.

pyserini/scripts/msmarco-passage/encode_corpus.py

It is a script to create dense (faiss) index using TCT_ColBERT encoder. run it in the following way with your corpus

python encode_corpus.py --encoder castorini/tct_colbert-msmarco --corpus <path/to/dir-with-jsonl-corpus> --index <dense index path> --device cuda:0

This will encode your corpus with TCT_ColBERT encoder, as I notice you are trying TctColBertQueryEncoder. TCT_ColBERT was trained on msmarco-passage dataset. As @lintool said, dense retrieval on custom collection (cross domain zero shot) may lead to terrible results.

sujitpal commented 3 years ago

Is there a recipe I could use if I am able to provide my own document vectors? Something like this, similar to the JSONL format provided in the documentation for "How do I index and search my own documents?". Basically something that takes these vectors and stuffs them into a FAISS index in a way that Pyserini can consume it for Dense and Hybrid search.

{
  "id": "doc1",
  "contents": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.",
  "vec": [ 0.1234, 0.4567, ..., 0.1245 ]
}

sujitpal commented 3 years ago

Also @MXueguang thank you for the encode_corpus.py I will check that out.

MXueguang commented 3 years ago

@sujitpal we don't have the custom vectors features right now, but it is not very hard to do the modification making it able to run with Pyserini. tl;dr you need to create 1. dense-index, 2. encoded-queries.

Dense index: if you look at the encoded_corpus.py you will notice the dense-index for Pyserini is a directory that containing to files. 1. docid, 2.index, index is just Faiss index, modifying from the necode_corpus, you can load your "vec" to the Faiss index. and then create the docid file, which stores the corresponding docid in a text file.
Encode Queries is a directory that contains single pickle file https://github.com/castorini/pyserini/blob/ca6026be7aac1828505ef023b2aa3183096ba076/scripts/msmarco-passage/encode_queries.py#L37-L43

While you get dense index and encoded-queries, you can run dense search following same way like https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md#dense-retrieval

sujitpal commented 3 years ago

This is very helpful, thank you!

lintool commented 3 years ago

Issue seems to have been resolved. Reopen if there's follow-up!

sujitpal commented 3 years ago

Yes thank you, I have enough information to go forward now!

lintool commented 3 years ago

If you come up with something interesting and generally useful, please contribute back.

sujitpal commented 3 years ago

Yes, of course, once I have something working, I will post back on this issue, and we can go from there.

sujitpal commented 3 years ago

Just wanted to thank @lintool and @MXueguang for the instructions, I was able to create the FAISS sidecar index (docid + index) and use the Sparse, Dense and Hybrid retrieval mechanisms. Sharing code here in case this is useful to add to documentation and/or for others who might have the same requirement as mine. First set of code blocks uses a pre-encoded set of queries in the pickled embedding.pkl file, the next set encodes the queries on the fly using the model and matches against the sidecar FAISS index.

sparse retrieval (baseline, no change)

from pyserini.search import SimpleSearcher

searcher = SimpleSearcher("../data/indexes/cord19_local_idx")
hits = searcher.search("coronavirus origin")
for i in range(10):
    print(i, hits[i].docid, hits[i].score)

dense retrieval with pre-encoded queries

from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder

# encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                               encoder)
hits = searcher.search("coronavirus origin")

for i in range(10):
    print(i, hits[i].docid, hits[i].score)

hybrid retrieval with pre-encoded queries

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher

ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                                encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')

for i in range(0, 10):
    print(i, hits[i].docid, hits[i].score)

dense retrieval with custom query encoder, no pre-encoding

from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from sentence_transformers import SentenceTransformer

class CustomQueryEncoder(QueryEncoder):
    def __init__(self, model):
        self.has_model = True
        self.model = model

    def encode(self, query: str):
        return self.model.encode([query])[0]

model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
encoder = CustomQueryEncoder(model)
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                               encoder)
hits = searcher.search("coronavirus origin")

for i in range(10):
    print(i, hits[i].docid, hits[i].score)

hybrid retrieval with custom query encoder, no pre-encoding

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher
from sentence_transformers import SentenceTransformer

class CustomQueryEncoder(QueryEncoder):
    def __init__(self, model):
        self.has_model = True
        self.model = model

    def encode(self, query: str):
        return self.model.encode([query])[0]

model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = CustomQueryEncoder(model)
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
                                encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')

for i in range(0, 10):
    print(i, hits[i].docid, hits[i].score)

lintool commented 3 years ago

Thanks for the code snippets!

castorini / pyserini