Dense retrieval: incorporate DPR collections

castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

http://pyserini.io/

Apache License 2.0

1.67k stars 374 forks source link

Dense retrieval: incorporate DPR collections #294

Closed lintool closed 3 years ago

lintool commented 3 years ago

We can fold in all the DPR collections into Pyserini, so we can do the retriever part of a QA system directly in Pyserini.

MXueguang commented 3 years ago

How about we make current QueryEncoder as abstract class. And make sub classes TCTColBERTQueryEncoder and DPRQueryEncoder.

The DPRQueryEncoder wrap this https://huggingface.co/transformers/model_doc/dpr.html#transformers.DPRQuestionEncoder

lintool commented 3 years ago

Yes, I think this is the right approach, although TCTColBERTQueryEncoder looks really ugly. I don't have any better suggestions though.

lintool commented 3 years ago

Aside, this also means that at some point in time we need to build sparse indexes for the Wikipedia collection used in DPR.

lintool commented 3 years ago

Ref: #325 - code merged!

@MXueguang We need a replication guide for this also...

Currently, we have: https://github.com/castorini/pyserini/blob/master/docs/dense-retrieval.md

Would it make sense to break into:

dense-retrieval-msmarco-passage.md
dense-retrieval-msmarco-doc.md
dense-retrieval-dpr.md

Thoughts?

MXueguang commented 3 years ago

yes, for msmarco-doc: we'll do that after we finish the msmarco-doc experiment for dpr, I guess we need to evaluate the result by the downstream qa evaluation?

lintool commented 3 years ago

for msmarco-doc: we'll do that after we finish the msmarco-doc experiment

Yup.

for dpr, I guess we need to evaluate the result by the downstream qa evaluation?

No, let's focus on only the retriever stage. The architecture is retriever-reader, right? And the DPR paper gives component effectiveness of only the retriever stage. Let's try to match those numbers.

MXueguang commented 3 years ago

How do we deal with the DPR retrieval evaluation? since the evaluation is different from regular IR tasks. i.e. evaluate by qrels two solutions:

write an script to evaluate DPR. This is straightforward
we can craft a qrels, i.e. given a question, for each document, we label 1 if it contains the answer for this question and create a topic file as well. This can make DPR be same as other tasks

lintool commented 3 years ago

Let's do (1) for now and just check in the official DPR eval script, just like we've checked in the MS MARCO scripts. Might want to put into tools/ so PyGaggle can also use, right @ronakice ?

MXueguang commented 3 years ago

emmm, I don't think they have an official "script" to evaluate. They wrap the evaluation inside their retrieval functions here. I am evaluating with the script written by myself.

MXueguang commented 3 years ago

with my script, I am getting:

Top20: 0.7794906931597579
Top100: 0.8460660043393856

Theirs are:

Top20: 0.784
Top100: 0.854

a bit lower, but I am using hnsw index rn. will evaluate on bf index next

MXueguang commented 3 years ago

close by https://github.com/castorini/pyserini/pull/335

MXueguang commented 3 years ago

will continue the discussion about replication result in https://github.com/castorini/pyserini/issues/336