castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.57k stars 349 forks source link

Improper Contriever encoding with the current pyserini.encode class #1911

Closed cramraj8 closed 1 month ago

cramraj8 commented 1 month ago

@lintool #1907 I tried to encode BEIR & Mr.TyDi/ MIRACL datasets using Contriever or mContriever models for indexing. But pyserini.encode throws the following error __main__.py encoder: error: argument --encoder-class: invalid choice: 'contriever' (choose from 'dpr', 'bpr', 'tct_colbert', 'ance', 'sentence-transformers', 'auto')

Then I used auto for the encoder-class, but my reproduced performance scores are much worse. I found out that even though contriever uses AutoDocumentEncoder class, in here contriever uses a different pooling operation compared to auto class.

Once I clone and modified the code by including a contriever option for the encoder-class, everything works fine and I was able to reproduce the scores.

I can make a pull request soon for this issue.