castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.68k stars 374 forks source link

Integrate ANCE into pyserini #382

Closed MXueguang closed 3 years ago

MXueguang commented 3 years ago

paper: https://arxiv.org/pdf/2007.00808.pdf https://github.com/microsoft/ANCE provided encoder checkpoints

MXueguang commented 3 years ago

we will integrate both msmarco and NQ from ANCE (ance-msmarco, ance-nq-trivia).

For ANCE-NQ-Trivia, I converted their checkpoint into hugging face model, which can fit into our current DPR encoder directly. For ANCE-Msmarco, we will create a new class with some hack on their code.

They didn't provide prebuilt indexes. We will encode the corpus by ourself, (writing encoding script).

MXueguang commented 3 years ago

replicated ANCE for NQ by

  1. converting their ckpt into hugging face DPR encoder.
  2. run our own passage embedding script to generate embedding
  3. run our current dpr pipeline with converted ANCE-NQ ckpt. Got result on nq-test as below:
    Top20   accuracy: 0.8224376731301939
    Top100  accuracy: 0.8786703601108034

Expected: Top20: 82.1 Top100: 87.9

converted model ckpt: ance-dpr-question-encoder-multi: https://www.dropbox.com/s/pps5rzzn4ynh3x3/ance-dpr-question_encoder-multi.tar.gz ance-dpr-context-encoder-multi: https://www.dropbox.com/s/diq5y8dd1bytje1/ance-dpr-context_encoder-multi.tar.gz

MXueguang commented 3 years ago

replicated ANCE for MS MARCO passage by

  1. create new class AnceEncoder on top of transformers.RobertaModel
  2. convert their ckpt so that can directly load into AnceEncoder, i.e. AnceEncoder.from_pretrained('ckpt_path')
  3. run our passage embedding script to generate embedding.
  4. add AnceQueryEncoder our current msmarco-passage dense pipeline.

Got result on msmarco-passage as below:

#####################
MRR @10: 0.3301838017919672
QueriesRanked: 6980
#####################
map                     all 0.3363
recall_1000             all 0.9584

expected: MRR@10: 0.330, recall_1000: 0.959

converted model ckpt: query_encoder = ctx_encoder: https://www.dropbox.com/s/u02glpszk3jv6ws/ance-msmarco-passage-encoder.tar.gz

MXueguang commented 3 years ago

ANCE-Msmarco-Passage and ANCE-NQ/TriviaQA has been replicated. Updating docs.

ANCE for msmarco document ranking can be integrated as well.

  1. we need to re-split msmarco doc into passages since the current per-passage index seems a bit inconvenient to fetch the title. i.e. we may want to \n split the url and titile instead of ' '
  2. ANCE-msmarco-doc can share same model class with ANCE-msmarco-passage
  3. create embeddings and then evaluate

we will close this issue after doc ranking get replicated.

lintool commented 3 years ago

ANCE on MS MARCO doc requires per-passage splitting also, right? Do the splits align with what we have for TCT-ColBERT and doc2query-T5?

MXueguang commented 3 years ago

yes. seems their split is not the same as ours. But we can apply their model on our splits, so that we can hybrid etc .

lintool commented 3 years ago

Agreed.

MXueguang commented 3 years ago

replicated ANCE for msmarco-doc by

lintool commented 3 years ago

I'm thinking the appropriate point of comparison is "2020/10/23 ANCE MaxP" from the leaderboard? https://microsoft.github.io/MSMARCO-Document-Ranking-Submissions/leaderboard/

That reports 0.384 for MRR@100.

In which case, we're pretty close... small difference easily attributable to passage definition?

This is already better than TCT-ColBERT (brute-force index) w/ MRR@100 0.3323. (Although TCT is zero-shot.)

lintool commented 3 years ago

hi @MXueguang I've successfully replicated everything here: https://github.com/castorini/pyserini/blob/master/docs/experiments-ance.md

I think everything except for MS MARCO doc can be "finalized" (e.g., hgf upload, etc.)? For MS MARCO doc, we're waiting on experiments w/ different definitions of passages, right?

MXueguang commented 3 years ago

yes

MXueguang commented 3 years ago

uploading following ckpts to hgf:

castorini/ance-msmarco-doc-firstp
castorini/ance-msmarco-doc-maxp
castorini/ance-msmarco-passage
castorini/ance-dpr-question-multi
castorini/ance-dpr-context-multi
lintool commented 3 years ago

👍

Once this stabilizes we should publish a new version of Pyserini. There are a bunch of cleanup issues we should close also...

lintool commented 3 years ago

Hey @MXueguang - next PR, we can add a link from the main README to the ANCE documentation.

lintool commented 3 years ago

I think we're all done here except for #444 testing above?

MXueguang commented 3 years ago

yeah

vrdn-23 commented 3 years ago

This is really amazing work folks!

Just thought I should mention something here that I didn't want to raise a separate issue for, but just as an FYI, on line 15 in pyserini/dsearch/_model.py

_keys_to_ignore_on_load_unexpected = [r'pooler", r"classifier']

the quotes are off, and this causes the HuggingFace PreTrained model to show this warning when we are loading the pertained checkpoints (eg: castorini/ance-msmarco-doc-maxp)

Some weights of the model checkpoint at castorini/ance-msmarco-doc-maxp were not used when initializing AnceEncoder: ['ance_encoder.classifier.dense.weight', 'ance_encoder.classifier.dense.bias', 'ance_encoder.classifier.out_proj.weight', 'ance_encoder.classifier.out_proj.bias']

  • This IS expected if you are initializing AnceEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing AnceEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Changing them to this produces the expected behavior:

_keys_to_ignore_on_load_unexpected = [r'pooler', r'classifier']

Thanks again for the amazing open source contributions!

lintool commented 3 years ago

Thanks! @vrdn-23 Do you want to send a PR directly?

BTW, if you've successfully run our code, please consider contributing to the reproduction logs? E.g., https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md#reproduction-log

vrdn-23 commented 3 years ago

Sure @lintool ! I'll have one ready in some time!

Currently I'm trying to run ANCE on the TREC CAsT data and it might take me some time before I get the entire dataset indexed by FAISS (unless perhaps there is already an encoded one made available by you? :) ), but if I do get some results that I can successfully reproduce, I'd be happy to add them to the documentation!

vrdn-23 commented 3 years ago

Done! https://github.com/castorini/pyserini/pull/473