castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.63k stars 356 forks source link

Dense indexing corpus not working #643

Closed shashankg7 closed 3 years ago

shashankg7 commented 3 years ago

Hi,

I have a Wikipedia related custom corpus, which I am trying to index using dense vectors.

But the command "python -m pyserini.dindex" is not working.

I am getting an error:

AttributeError: module 'pyserini' has no attribute 'dindex'

I have installed all of the dependencies, so not sure what is wrong.

Please let me know.

lintool commented 3 years ago

Ah, this is a newly-introduced feature than hasn't been published in the PyPI package yet... so you'll need to get a dev installation: https://github.com/castorini/pyserini/#development-installation

shashankg7 commented 3 years ago

Thank you @lintool for your response.

I was able to get it run from the development version.

I have a follow-up question. I am trying to index a Wikipedia corpus and search on it using dense retrieval.

I tried the DPR encoder, but it's not giving good results. I think this is due to the fact that DPR is trained in QA domain.

Any suggestion on what encoder I can use for Wikipedia?

lintool commented 3 years ago

The short answer is... we don't know. You're basically talking about an open research question...

Dense retrieval is known to be very corpus/query specific, often with poor zero-short effectiveness when transferred to another collection.

That's why we have benchmarks like https://github.com/UKPLab/beir to further explore...

shashankg7 commented 3 years ago

Thanks, @lintool for your response and pointer to the IR benchmark. I'll look more into it.