castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.02k stars 444 forks source link

Impact of aligning initial retrieval tokenization with later transformer stages? #1123

Closed lintool closed 2 years ago

lintool commented 4 years ago

What if we performed initial retrieval using the same tokenizer that's used in the transformer-based reranking stage (e.g., BPE, SentencePiece, etc.)? What would be the impact?

stephaniewhoo commented 3 years ago
c45hu@orca:~/anserini$ python tools/scripts/msmarco/msmarco_passage_eval.py  collections/msmarco-passage/qrels.dev.small.tsv runs/run.pretokenized.dev.small.tsv 
#####################
MRR @10: 0.18234689816709873
QueriesRanked: 6980
#####################

This is slightly lower than prev MRR@10.

lintool commented 3 years ago

hey @stephaniewhoo IIRC Leo et al. looked at this same issue recently - what were their findings for reference?

lintool commented 3 years ago

hey @stephaniewhoo can you try cleaning the text as well? https://github.com/castorini/anserini/issues/1212

stephaniewhoo commented 3 years ago

hey @stephaniewhoo can you try cleaning the text as well? #1212

So like instead of applying bert tokenized collection as input but using encoded text?

stephaniewhoo commented 3 years ago

@lintool I tried two ways.

  1. fix_text() on collection and queries + indexing and retrieving are with pretokenized option. The mrr is really low: ~0.05. I think I did all steps right... my guess is that fix_text() only encodes Latin texts. When applying with whitespace analyzer directly later, I found there are many document tokens in index left like (neudeutsches, 45°. etc. Could this be a reason to cause bad result? (If we perform normal index and retrieve, i.e. without pretokenized, mrr result is same as Cui reported in #1212)

  2. fix_text() to clean first and then apply the bert tokenizer. (further indexing and retrieving are with pretokenized option)

    python tools/scripts/msmarco/msmarco_passage_eval.py  collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.dev.clean-bert.small.tsv 
    #####################
    MRR @10: 0.18352264974757815
    QueriesRanked: 6980
    #####################

    This is the mrr result. Higher than prev 0.182 (cleaning text does help) but still lower than our baseline.

lintool commented 3 years ago

Can you try to replicate Cui: https://github.com/castorini/anserini/issues/1212#issuecomment-632465776

I just want to make sure that result wasn't a fluke.

lintool commented 3 years ago

Okay, so next we try to do e2e.

  1. Take this - https://github.com/castorini/anserini/issues/1123#issuecomment-821199782 - and rerank w/ BERT
  2. Take (2) above, and rerank w/ BERT

Let's see what we get?

stephaniewhoo commented 3 years ago

Can you try to replicate Cui: #1212 (comment)

I just want to make sure that result wasn't a fluke.

Yes, I have done the replication same result as Cui.

python tools/scripts/msmarco/msmarco_passage_eval.py collections/msmarco-passage/qrels.dev.small.tsv runs/test.clean.small.tsv
#####################
MRR @10: 0.18809080365670577
QueriesRanked: 6980
#####################
lintool commented 3 years ago

cc @MXueguang - we should think about if it's worthwhile to push out a prebuilt index with clean text...

crystina-z commented 3 years ago

replication score

  1. with mBERT-base-uncased

    (ict) x978zhan@tuna:~/task-pretok$ cat mbert/score
    #####################
    MRR @10: 0.18431482239505195
    QueriesRanked: 6980
    #####################
  2. with bert-base-uncased

    #####################
    MRR @10: 0.18461994587710923
    QueriesRanked: 6980
    #####################

the script I was using: https://gist.github.com/crystina-z/caff5c1cbc440f1fc24f337d640a4d8d for each file, it took 20~30min to finish the tokenization

stephaniewhoo commented 3 years ago

finally get results from cc for using encoder to clean text first and then using BERT tokenizer, reranking with bert

#####################
MRR @10: 0.37588893437031035
QueriesRanked: 6980
#####################

It's still a bit lower than our original BERT result. Still waiting on without text cleaning.

lintool commented 3 years ago

For reference, original BERT results here: https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage-entire.md#re-ranking-with-monobert

mrr@10          0.37922
stephaniewhoo commented 3 years ago

Without clean text, result from BERT is

[c45hu@cedar5 pygaggle]$ python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_pretokenized/qrels.dev.small.tsv runs/run.monobert.ans_entire.dev.pretokenized.tsv 
#####################
MRR @10: 0.3751549779415115
QueriesRanked: 6980
#####################
lintool commented 2 years ago

I think we have a good sense of what's going on here, closing...