Closed MXueguang closed 3 years ago
we will integrate both msmarco and NQ from ANCE (ance-msmarco, ance-nq-trivia).
For ANCE-NQ-Trivia, I converted their checkpoint into hugging face model, which can fit into our current DPR encoder directly. For ANCE-Msmarco, we will create a new class with some hack on their code.
They didn't provide prebuilt indexes. We will encode the corpus by ourself, (writing encoding script).
replicated ANCE for NQ by
Top20 accuracy: 0.8224376731301939
Top100 accuracy: 0.8786703601108034
Expected: Top20: 82.1 Top100: 87.9
converted model ckpt: ance-dpr-question-encoder-multi: https://www.dropbox.com/s/pps5rzzn4ynh3x3/ance-dpr-question_encoder-multi.tar.gz ance-dpr-context-encoder-multi: https://www.dropbox.com/s/diq5y8dd1bytje1/ance-dpr-context_encoder-multi.tar.gz
replicated ANCE for MS MARCO passage by
AnceEncoder
on top of transformers.RobertaModel
AnceEncoder
, i.e. AnceEncoder.from_pretrained('ckpt_path')
Got result on msmarco-passage as below:
#####################
MRR @10: 0.3301838017919672
QueriesRanked: 6980
#####################
map all 0.3363
recall_1000 all 0.9584
expected: MRR@10: 0.330, recall_1000: 0.959
converted model ckpt: query_encoder = ctx_encoder: https://www.dropbox.com/s/u02glpszk3jv6ws/ance-msmarco-passage-encoder.tar.gz
ANCE-Msmarco-Passage and ANCE-NQ/TriviaQA has been replicated. Updating docs.
ANCE for msmarco document ranking can be integrated as well.
\n
split the url and titile instead of ' '
we will close this issue after doc ranking get replicated.
ANCE on MS MARCO doc requires per-passage splitting also, right? Do the splits align with what we have for TCT-ColBERT and doc2query-T5?
yes. seems their split is not the same as ours. But we can apply their model on our splits, so that we can hybrid etc .
Agreed.
replicated ANCE for msmarco-doc by
#####################
MRR @100: 0.37965620295359753
QueriesRanked: 5193
#####################
but we don't have a reference for this result rn since passage defination is different
I'm thinking the appropriate point of comparison is "2020/10/23 ANCE MaxP" from the leaderboard? https://microsoft.github.io/MSMARCO-Document-Ranking-Submissions/leaderboard/
That reports 0.384 for MRR@100.
In which case, we're pretty close... small difference easily attributable to passage definition?
This is already better than TCT-ColBERT (brute-force index) w/ MRR@100 0.3323. (Although TCT is zero-shot.)
hi @MXueguang I've successfully replicated everything here: https://github.com/castorini/pyserini/blob/master/docs/experiments-ance.md
I think everything except for MS MARCO doc can be "finalized" (e.g., hgf upload, etc.)? For MS MARCO doc, we're waiting on experiments w/ different definitions of passages, right?
yes
uploading following ckpts to hgf:
castorini/ance-msmarco-doc-firstp
castorini/ance-msmarco-doc-maxp
castorini/ance-msmarco-passage
castorini/ance-dpr-question-multi
castorini/ance-dpr-context-multi
👍
Once this stabilizes we should publish a new version of Pyserini. There are a bunch of cleanup issues we should close also...
Hey @MXueguang - next PR, we can add a link from the main README to the ANCE documentation.
I think we're all done here except for #444 testing above?
yeah
This is really amazing work folks!
Just thought I should mention something here that I didn't want to raise a separate issue for, but just as an FYI, on line 15 in pyserini/dsearch/_model.py
_keys_to_ignore_on_load_unexpected = [r'pooler", r"classifier']
the quotes are off, and this causes the HuggingFace PreTrained model to show this warning when we are loading the pertained checkpoints (eg: castorini/ance-msmarco-doc-maxp)
Some weights of the model checkpoint at castorini/ance-msmarco-doc-maxp were not used when initializing AnceEncoder: ['ance_encoder.classifier.dense.weight', 'ance_encoder.classifier.dense.bias', 'ance_encoder.classifier.out_proj.weight', 'ance_encoder.classifier.out_proj.bias']
- This IS expected if you are initializing AnceEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AnceEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Changing them to this produces the expected behavior:
_keys_to_ignore_on_load_unexpected = [r'pooler', r'classifier']
Thanks again for the amazing open source contributions!
Thanks! @vrdn-23 Do you want to send a PR directly?
BTW, if you've successfully run our code, please consider contributing to the reproduction logs? E.g., https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md#reproduction-log
Sure @lintool ! I'll have one ready in some time!
Currently I'm trying to run ANCE on the TREC CAsT data and it might take me some time before I get the entire dataset indexed by FAISS (unless perhaps there is already an encoded one made available by you? :) ), but if I do get some results that I can successfully reproduce, I'd be happy to add them to the documentation!
paper: https://arxiv.org/pdf/2007.00808.pdf https://github.com/microsoft/ANCE provided encoder checkpoints