castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.69k stars 377 forks source link

WTF? Two versions of QueryEncoder and two versions of AutoQueryEncoder #2009

Open lintool opened 1 month ago

lintool commented 1 month ago

We have two versions of QueryEncoder and two versions of AutoQueryEncoder. One set of classes is in pyserini.search.faiss, the other set is in pyserini.encode.

(I'm pointing to code at commit just prior to #1997 - because I think that commit breaks a number of things.)

@MXueguang has clearly stated in #1728 that the version in pyserini.encode is actually the one we should use. This is true, because only that version has the option to correctly handle the query prefix, which is needed for BGE to work correctly. However, the QueryEncoder in pyserini.search.faiss is the one that actually works, because only that version downloads query encodings.

So, here's the puzzle: How did we get into a state where we're using AutoQueryEncoder in pyserini.encode but QueryEncoder in pyserini.search.faiss, where the code is so crazily intertwined, and all the regressions pass?

Here's the crazy answer:

In pyserini/search/faiss/__main__.py, this is the import statement:

from pyserini.search import FaissSearcher, BinaryDenseSearcher, TctColBertQueryEncoder, QueryEncoder, \
    DprQueryEncoder, BprQueryEncoder, DkrrDprQueryEncoder, AnceQueryEncoder, AggretrieverQueryEncoder, DenseVectorAveragePrf, \
    DenseVectorRocchioPrf, DenseVectorAncePrf, OpenAIQueryEncoder, ClipQueryEncoder

from pyserini.encode import PcaEncoder, CosDprQueryEncoder, AutoQueryEncoder

So the Faiss searcher is getting most of the models from pyserini.search, and if you trace the imports to pyserini/search/__init__.py, we see the imports "loop back to itself":

from .faiss import DenseSearchResult, PRFDenseSearchResult, FaissSearcher, BinaryDenseSearcher, QueryEncoder, \
    DprQueryEncoder, BprQueryEncoder, DkrrDprQueryEncoder, TctColBertQueryEncoder, AnceQueryEncoder, AggretrieverQueryEncoder, AutoQueryEncoder, ClipQueryEncoder
from .faiss import AnceEncoder
from .faiss import DenseVectorAveragePrf, DenseVectorRocchioPrf, DenseVectorAncePrf
from .faiss import OpenAIQueryEncoder

Which means that for most of the encoder classes, the implementations in pyserini/search/faiss/_searcher.py are used.

Except for AutoQueryEncoder (and CosDprQueryEncoder, but that's an aside).

So now that we understand what's going on, it's probably easier to fix. This also means that #1997 is broken, because it uses the wrong implementation of AutoQueryEncoder.

MXueguang commented 1 month ago

We actually also need to take care about imports in https://github.com/castorini/pyserini/blob/master/pyserini/encode/query.py What do you think on this?

lintool commented 1 month ago

Yes, definitely, that will need refactoring. First this though: https://github.com/castorini/pyserini/pull/2008

lintool commented 1 month ago

Leaving open because we still need to think about what to do with: