Better BM25 tuning with skopt

castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

http://pyserini.io/

Apache License 2.0

1.64k stars 364 forks source link

Better BM25 tuning with skopt #564

Closed lintool closed 3 years ago

lintool commented 3 years ago

I do grid search for tuning BM25 here: https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md#bm25-tuning

Which is kinda stupid.

We should use skopt: https://scikit-optimize.github.io/stable/

@alexlimh can you please contribute this after EMNLP?

alexlimh commented 3 years ago

no problem, will look into that after EMNLP.

alexlimh commented 3 years ago

Hyperparameter tuning results on msmarco using skopt (Gaussian process) for 50 iterations The original results using grid search can be found here: https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md

Setting	MRR@10	MAP	Recall@1000
Default (k1=0.9,b=0.4)	0.1840	0.1926	0.8526
Grid Search, Optimized for recall@1000 (k1=0.82, b=0.68)	0.1874	0.1957	0.8573
Skopt, Optimized for recall@1000 (k1=0.75, b=0.87)	0.1885	0.1966	0.8596
Grid Search, Optimized for MRR@10/MAP (k1=0.60, b=0.62)	0.1892	0.1972	0.8555
Skopt, Optimized for MRR@10 (k1=0.61, b=0.78)	0.1907	0.1987	0.8578
Skopt, Optimized for MAP (k1=0.60, b=0.81)	0.1908	0.1989	0.8581

lintool commented 3 years ago

Nice, so there's still a bit more to be gained!

A few questions:

What training queries are you using?
How long does 50 iterations take? Are 50 enough?
If we run multiple times with different randomization, do we get similar parameters?

alexlimh commented 3 years ago

I used collections/msmarco-passage/queries.dev.small.tsv
It took 2.5hrs to train for 50 epochs on ~6900 queries with msmarco index, and 50 is long enough as it converges pretty fast
will try different random seeds today

lintool commented 3 years ago

But then you're training and testing on the same dev queries?

You should probably use the queries here for a fair comparison w/ grid search? https://github.com/castorini/Anserini-data/tree/master/MSMARCO

alexlimh commented 3 years ago

I just followed the tune_bm25.py and I didn't change the code except for the grid search part.

As for the training queries do you mean by this one:

# Evaluate with official scoring script
results = subprocess.check_output(['python', 'tools/scripts/msmarco/msmarco_passage_eval.py',
                                'collections/msmarco-passage/qrels.train.tsv',

Here's the script I used:

python tools/scripts/msmarco/tune_bm25_skopt.py --base-directory runs_$metric \
        --index indexes/msmarco-passage/lucene-index-msmarco \
        --queries collections/msmarco-passage/queries.dev.small.tsv \
        --qrels-tsv collections/msmarco-passage/qrels.dev.small.tsv \
        --qrels-trec collections/msmarco-passage/qrels.dev.small.trec \
        --skopt-iters $iters \
        --hits $hits \
        --metric $metric \
        --seed $seed \
        --threads 16

alexlimh commented 3 years ago

I see the mistakes. Will take care of this today.

alexlimh commented 3 years ago

New results using 5 training subsets for tuning k1 and b:

Setting	MRR@10	MAP	Recall@1000
Default (k1=0.9,b=0.4)	0.1840	0.1926	0.8526
Grid Search, Optimized for recall@1000 (k1=0.82, b=0.68)	0.1874	0.1957	0.8573
Skopt, Optimized for recall@1000 (k1=0.68, b=0.72)	0.1890	0.1971	0.8575
Grid Search, Optimized for MRR@10/MAP (k1=0.60, b=0.62)	0.1892	0.1972	0.8555
Skopt, Optimized for MAP (k1=0.63, b=0.62)	0.1892	0.1972	0.8564

lintool commented 3 years ago

Closing issue. It seems like Skopt is overkill for tuning BM25, since grid search seems to suffice.