castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.02k stars 449 forks source link

MS MARCO passage regression errors: BM25prf gives non-deterministic results #774

Closed lintool closed 5 years ago

lintool commented 5 years ago

Hi @emmileaf I'm getting these MS MARCO passage regression errors:

This is on tuna:

2019-08-11 03:55:02,107 - regression_test - ERROR - !!!!!{"actual": 0.1518, "collection": "msmarco-passage", "expected": 0.152, "metric": "map", "model": "bm25-default+prf", "topic": "[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)"}!!!!!
...
2019-08-11 03:56:41,141 - regression_test - ERROR - !!!!!{"actual": 0.1579, "collection": "msmarco-passage", "expected": 0.1582, "metric": "map", "model": "bm25-tuned+prf", "topic": "[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)"}!!!!!

This is on another machine:

2019-08-11 03:38:45,575 - regression_test - ERROR - !!!!!{"actual": 0.1519, "collection": "msmarco-passage", "expected": 0.152, "metric": "map", "model": "bm25-default+prf", "topic": "[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)"}!!!!!
...
2019-08-11 03:39:51,630 - regression_test - ERROR - !!!!!{"actual": 0.158, "collection": "msmarco-passage", "expected": 0.1582, "metric": "map", "model": "bm25-tuned+prf", "topic": "[MS MARCO Passage Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Passage-Ranking)"}!!!!!

It seems like BM25prf gives non-deterministic results?

@matthew-z any ideas?

matthew-z commented 5 years ago

BM25PRF is expected to give deterministic results.

Could you please run BM25PRF again with -bm25prf.outputQuery argument to log the query expansion? I will also try to run on my server, but I don't have the collection right now.

emmileaf commented 5 years ago

From a quick look at the runs, this might be a score tie handling issue?

https://github.com/castorini/anserini/blob/5b29d1654abc5e8a014c2230da990ab2f91fb340/src/main/java/io/anserini/rerank/lib/BM25PrfReranker.java#L113

https://github.com/castorini/anserini/blob/5b29d1654abc5e8a014c2230da990ab2f91fb340/src/main/java/io/anserini/rerank/lib/Rm3Reranker.java#L111-L118

I'll add in tiebreak handling similar to the other rerankers, and update regression numbers for PRF.

emmileaf commented 5 years ago

An update on this issue: tried changing the line mentioned above and re-ran retrieval a couple of times, but results are still sometimes inconsistent.

Discrepancies look like they're mostly from closely scoring documents having different order, though there's probably something else in the code that I missed in the comment above…

A small example diff:

< 26664 Q0 1469568 841 11.726800 Anserini
< 26664 Q0 3427981 842 11.726799 Anserini
---
> 26664 Q0 3427981 841 11.726800 Anserini
> 26664 Q0 1469568 842 11.726799 Anserini
emmileaf commented 5 years ago

Turns out I had made a different mistake while verifying/testing the changes earlier 🤦‍♀ Tie-breaking seems to have fixed it with no regression number changes - will follow-up with PR.

lintool commented 5 years ago

Regression error has cropped up again when running:

python src/main/python/run_regression.py --index --collection msmarco-doc >& log.msmarco-doc

Results on damiano:

2019-09-04 03:02:24,322 - regression_test - ERROR - !!!!!{"actual": 0.1357, "collection": "msmarco-doc", "expected": 0.1359, "metric": "map", "model": "bm25-default+prf", "topic": "[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/TREC-2019-Deep-Learning)"}!!!!!
2019-09-04 03:03:14,602 - regression_test - ERROR - !!!!!{"actual": 0.1559, "collection": "msmarco-doc", "expected": 0.1562, "metric": "map", "model": "bm25-tuned+prf", "topic": "[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/TREC-2019-Deep-Learning)"}!!!!!
lintool commented 5 years ago

Two trials on tuna (Java 8) give the same result. Two trials on my iMac Pro (Java 8) gives the same result.

I think it's just the case that we forgot to update the regression values.

See PR #788