Open lintool opened 1 month ago
We actually also need to take care about imports in https://github.com/castorini/pyserini/blob/master/pyserini/encode/query.py What do you think on this?
Yes, definitely, that will need refactoring. First this though: https://github.com/castorini/pyserini/pull/2008
Leaving open because we still need to think about what to do with:
pyserini/encode/query.py
pyserini/encode/__main__.py
We have two versions of
QueryEncoder
and two versions ofAutoQueryEncoder
. One set of classes is inpyserini.search.faiss
, the other set is inpyserini.encode
.QueryEncoder
inpyserini/search/faiss/_searcher.py
QueryEncoder
inpyserini/encode/_base.py
AutoQueryEncoder
inpyserini/search/faiss/_searcher.py
AutoQueryEncoder
inpyserini/encode/_auto.py
(I'm pointing to code at commit just prior to #1997 - because I think that commit breaks a number of things.)
@MXueguang has clearly stated in #1728 that the version in
pyserini.encode
is actually the one we should use. This is true, because only that version has the option to correctly handle the query prefix, which is needed for BGE to work correctly. However, theQueryEncoder
inpyserini.search.faiss
is the one that actually works, because only that version downloads query encodings.So, here's the puzzle: How did we get into a state where we're using
AutoQueryEncoder
inpyserini.encode
butQueryEncoder
inpyserini.search.faiss
, where the code is so crazily intertwined, and all the regressions pass?Here's the crazy answer:
In
pyserini/search/faiss/__main__.py
, this is the import statement:So the Faiss searcher is getting most of the models from
pyserini.search
, and if you trace the imports topyserini/search/__init__.py
, we see the imports "loop back to itself":Which means that for most of the encoder classes, the implementations in
pyserini/search/faiss/_searcher.py
are used.Except for
AutoQueryEncoder
(andCosDprQueryEncoder
, but that's an aside).So now that we understand what's going on, it's probably easier to fix. This also means that #1997 is broken, because it uses the wrong implementation of
AutoQueryEncoder
.