Closed sujitpal closed 3 years ago
HI @sujitpal thanks for your issue.
Re: Problem 1, I think this is a known issue, @MXueguang can help with that.
Re: Problem 2, yes, you'll need to build a dense index, but currently Pyserini does not support "dense indexing" your own custom collections. The reason is that transformer encoders are trained to be collection specific - so, for example, using tct_colbert
on your collection out of the box will likely lead to terrible results. This is a known issue that the community is working on, but we're basically talking about an open research problem. Training dense retrieval models (from scratch) is beyond the scope of Pyserini - for that, you'll have to look at the DPR repo https://github.com/facebookresearch/DPR or something similar.
Hope this helps!
Hi @sujitpal
Problem1 is related to https://github.com/castorini/pyserini/issues/567, we currently fix it by restricting transformers<=4.5.0.
about Problem2. Currently, we haven't integrated the dense indexing process into Pyserini yet. I am working on that https://github.com/castorini/pyserini/issues/497 But we do have some scripts right now that may unblock you.
pyserini/scripts/msmarco-passage/encode_corpus.py
It is a script to create dense (faiss) index using TCT_ColBERT encoder. run it in the following way with your corpus
python encode_corpus.py --encoder castorini/tct_colbert-msmarco --corpus <path/to/dir-with-jsonl-corpus> --index <dense index path> --device cuda:0
This will encode your corpus with TCT_ColBERT
encoder, as I notice you are trying TctColBertQueryEncoder
.
TCT_ColBERT was trained on msmarco-passage dataset.
As @lintool said, dense retrieval on custom collection (cross domain zero shot) may lead to terrible results.
Is there a recipe I could use if I am able to provide my own document vectors? Something like this, similar to the JSONL format provided in the documentation for "How do I index and search my own documents?". Basically something that takes these vectors and stuffs them into a FAISS index in a way that Pyserini can consume it for Dense and Hybrid search.
{
"id": "doc1",
"contents": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.",
"vec": [ 0.1234, 0.4567, ..., 0.1245 ]
}
Also @MXueguang thank you for the encode_corpus.py
I will check that out.
@sujitpal we don't have the custom vectors features right now, but it is not very hard to do the modification making it able to run with Pyserini. tl;dr you need to create 1. dense-index, 2. encoded-queries.
encoded_corpus.py
you will notice the dense-index for Pyserini is a directory that containing to files. 1. docid
, 2.index
, index
is just Faiss index, modifying from the necode_corpus, you can load your "vec" to the Faiss index. and then create the docid
file, which stores the corresponding docid
in a text file.While you get dense index and encoded-queries, you can run dense search following same way like https://github.com/castorini/pyserini/blob/master/docs/experiments-tct_colbert.md#dense-retrieval
This is very helpful, thank you!
Issue seems to have been resolved. Reopen if there's follow-up!
Yes thank you, I have enough information to go forward now!
If you come up with something interesting and generally useful, please contribute back.
Yes, of course, once I have something working, I will post back on this issue, and we can go from there.
Just wanted to thank @lintool and @MXueguang for the instructions, I was able to create the FAISS sidecar index (docid + index) and use the Sparse, Dense and Hybrid retrieval mechanisms. Sharing code here in case this is useful to add to documentation and/or for others who might have the same requirement as mine. First set of code blocks uses a pre-encoded set of queries in the pickled embedding.pkl file, the next set encodes the queries on the fly using the model and matches against the sidecar FAISS index.
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher("../data/indexes/cord19_local_idx")
hits = searcher.search("coronavirus origin")
for i in range(10):
print(i, hits[i].docid, hits[i].score)
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
# encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
encoder)
hits = searcher.search("coronavirus origin")
for i in range(10):
print(i, hits[i].docid, hits[i].score)
from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher
ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = QueryEncoder(encoded_query_dir="../data/query-embeddings")
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')
for i in range(0, 10):
print(i, hits[i].docid, hits[i].score)
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from sentence_transformers import SentenceTransformer
class CustomQueryEncoder(QueryEncoder):
def __init__(self, model):
self.has_model = True
self.model = model
def encode(self, query: str):
return self.model.encode([query])[0]
model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
encoder = CustomQueryEncoder(model)
searcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
encoder)
hits = searcher.search("coronavirus origin")
for i in range(10):
print(i, hits[i].docid, hits[i].score)
from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, QueryEncoder
from pyserini.hsearch import HybridSearcher
from sentence_transformers import SentenceTransformer
class CustomQueryEncoder(QueryEncoder):
def __init__(self, model):
self.has_model = True
self.model = model
def encode(self, query: str):
return self.model.encode([query])[0]
model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
ssearcher = SimpleSearcher("../data/indexes/cord19_local_idx")
encoder = CustomQueryEncoder(model)
dsearcher = SimpleDenseSearcher("../data/indexes/cord19_local_idx",
encoder)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('coronavirus origin')
for i in range(0, 10):
print(i, hits[i].docid, hits[i].score)
Thanks for the code snippets!
My environment:
Problem 1
I followed instructions to create my own minimal index and was able to run the Sparse Retrieval example successfully. However, when I tried to run the Dense retrieval example using the TctColBertQueryEncoder, I encountered the following issues that seem to be caused by me having a newer version of the transformers library, where the
requires_faiss
andrequires_pytorch
methods have been replaced with a more generalrequires_backends
method intransformers.file_utils
. The following files were affected.Problem 2
Replacing them in place in the Pyserini code in my
site-packages
allowed me to move forward, but now I get the error message:The
/path/to/lucene_index
above is a folder where my lucene index was built usingpyserini.index
. I am guessing that an additional ANN index might be required to be built from the data to allow Dense searching to happen? I looked in the help forpyserini.index
but there did not seem to be anything that indicated creation of ANN index.I can live with the first problem (since I have a local solution) but obviously some fix to that would be nice. For the second problem, some documentation or help with building a local index for dense searching will be very much appreciated.
Thanks!