castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.57k stars 349 forks source link

How are you handling duplicate entries for the corpus and qrels? #1902

Closed steven-channel closed 1 month ago

steven-channel commented 1 month ago

While running some evaluation on the MIRACL-Korean benchmark, I'm noticing that the qrels and corpus files contain duplicate IDs which is causing some errors. Is this being handled somewhere?

steven-channel commented 1 month ago

Seems like a similar issue was in Anserini?

https://github.com/castorini/anserini/issues/720