castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.62k stars 354 forks source link

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order #308

Closed lintool closed 3 years ago

lintool commented 3 years ago

If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md

We'll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:

#####################
MRR @100: 0.2770296928568709
QueriesRanked: 5193
#####################

Compared to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

Previously, we tracked it down issue #257

I'd like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it'll work: Let's just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we're iterating over the dataset in pyserini.search, we just follow the order of the integers.

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts @MXueguang? I was thinking you could work on this?

MXueguang commented 3 years ago

e.g. msmarco-doc

174249  does xpress bet charge to deposit money in your account
320792  how much is a cost to run disneyland
1090270 botulinum definition
1101279 do physicians pay for insurance from their salaries?
201376  here there be dragons comic
54544   blood diseases that are sexually transmitted

we want to search with id sorted?

MXueguang commented 3 years ago

oh I see, the pyserini is following the increasing order of query id. but we want it follow the order of https://github.com/castorini/anserini/blob/master/src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt?

lintool commented 3 years ago

Correct. If you take the pyserini and anserini output, sort both, you'll see that they're identical... so the issue must come from query ordering....

MXueguang commented 3 years ago

but if we want to get the array of id, we have to manually load the id from topics.msmarco-doc.dev.txt?

lintool commented 3 years ago

My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

That way, you'll never "lose" the file. Space is cheap?

MXueguang commented 3 years ago

and then for loop that global variable here? https://github.com/castorini/pyserini/blob/9c7c3b1df0d66b46db5585dcbfc88a2bdd238242/pyserini/search/__main__.py#L134

lintool commented 3 years ago

Yea, except this:

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts?

MXueguang commented 3 years ago

I prefer the first one actually.

My suggestion is to just to take the ids, stuff in a Python array, and treat as a global variable, exactly like this: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

we just fix that for experiment replications right? so it is straightforward.

I had a try on msmarco-doc now we can get result identical to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

the only shortage is the global variable took 40k columns but that is fine.

lintool commented 3 years ago

I prefer the first one actually.

Now that I think about it, I think introducing a query iterator is cleaner? If you just fix here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L1

It'll only be fixed for pyserini.search - if we introduce query iterator, it'll be more future proof... i.e., for dense retrieval, hybrid, etc. and direct library usage...

MXueguang commented 3 years ago

i see. so query iterator takes cur_dict and order_array and yield (id, text) pairs

MXueguang commented 3 years ago

ill draft pr

lintool commented 3 years ago

BTW, please make this work for both MS MARCO {doc, passage} x {dev, eval}.

MXueguang commented 3 years ago

I was thinking about making the query_order keyed by the topics name, but seems eval topics are not in here https://github.com/castorini/pyserini/blob/9c7c3b1df0d66b46db5585dcbfc88a2bdd238242/pyserini/search/_base.py#L75

lintool commented 3 years ago

They're here in anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/topicreader/TopicReader.java#L67

Part of the latest release... they just need to be exposed - via the hook you linked to above.

lintool commented 3 years ago

Closed by #309