castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.63k stars 356 forks source link

Searcher should add an "normalize" argument? #1952

Open dayuyang1999 opened 1 month ago

dayuyang1999 commented 1 month ago

Hi,

If I use my own embedding model like bge-large-en-v1.5.

Because the model is trained on optimizing cosine similarity. When creating index, the correct implementation should add --l2-norm option.

--l2-norm

However, when creating FaissSearcher for search, it seems there is no option for normalizing the embedding.

class FaissSearcher:
    """Simple Searcher for dense representation

    Parameters
    ----------
    index_dir : str
        Path to faiss index directory.
    """

    def __init__(self, index_dir: str, query_encoder: Union[QueryEncoder, str],
                 prebuilt_index_name: Optional[str] = None):
        requires_backends(self, "faiss")
        if not isinstance(query_encoder, str):
            self.query_encoder = query_encoder
        else:
            self.query_encoder = self._init_encoder_from_str(query_encoder)
        self.index, self.docids = self.load_index(index_dir)
        self.dimension = self.index.d
        self.num_docs = self.index.ntotal

        assert self.docids is None or self.num_docs == len(self.docids)
        if prebuilt_index_name:
            sparse_index = get_sparse_index(prebuilt_index_name)
            self.ssearcher = LuceneSearcher.from_prebuilt_index(sparse_index)
MXueguang commented 1 month ago

hi @dayuyang1999, At search time, for l2 norm vectors, we assume the indexes are built with vector normalized already and the query encoder is generating normalized vectors. You can make the l2-norm=true when you initialize the query encoder and then pass the query encoder to the searcher.