castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.57k stars 349 forks source link

Add bindings for MS MARCO V2.1 doc: prebuilt indexes, topics, qrels #1917

Closed lintool closed 1 month ago

lintool commented 1 month ago

This works:

export OUTPUT_DIR="runs"

# doc condition
TOPICS=(msmarco-v2-doc.dev msmarco-v2-doc.dev2 dl21-doc dl22-doc dl23-doc rag24.raggy-dev); for t in "${TOPICS[@]}"
do
    python -m pyserini.search.lucene --threads 16 --batch-size 128 --index msmarco-v2.1-doc --topics $t --output $OUTPUT_DIR/run.msmarco-v2.1.doc.${t}.txt --bm25
done

# doc-segmented condition
TOPICS=(msmarco-v2-doc.dev msmarco-v2-doc.dev2 dl21-doc dl22-doc dl23-doc rag24.raggy-dev); for t in "${TOPICS[@]}"
do
    python -m pyserini.search.lucene --threads 16 --batch-size 128 --index msmarco-v2.1-doc-segmented --topics $t --output $OUTPUT_DIR/run.msmarco-v2.1.doc-segmented.${t}.txt --bm25 --hits 10000 --max-passage-hits 1000 --max-passage
done