Closed ghost closed 4 years ago
hey @egzhbdt cool, thanks for this!
After chatting with the team, we think this doc might be best in pygaggle itself? Maybe you could start a docs/
directory for us and drop this in there? Please send PR. We could have mutual pointers between docs in anserini and pygaggle?
With the recent addition of bindings for the indexer in Python, I think this is resolved.
Hi, I am trying to find the rerank_custom_collection.py. It seems the page is not available. Please advice.
@ghost ?
I appreciate if you can update on my request above.
Hi @Fatima-200159617 , apologize for the confusion and thanks for this. It was rerank_custom_collection.py.
Great thanks.
Hi,
It might be good to complete the custom collection doc with light instructions on passage reranking, after seeing this issue.
pygaggle has been a well-bundled and transparent resource for the CORD-19, and for other text ranking in the future. Below is a bottom-up snapshot from what it provides. It is however free-style and therefore less formalized.
Rerank with monoBERT
Optionally, you could rerank the above retrieval results. We provide a minimum working example
rerank_custom_collection.py
for this.The example follows pygaggle and duoBERT (up to monoBERT in the figure below, figure source here), less the part to evaluate with ground-truth.
It calls a reranker to score (query, passage) pairs. The reranker is a pre-trained transformer model on a general passage retrieval task such as MS MARCO.
Prepare Input Files from the
QuickStart
Step[OUTPUT_PATH]
.[QUERY_FILE_PATH]
.[PASSAGE_ID2TEXT_PATH]
.docid[\t]passage_raw_text[\n]
. No header. We don't differentiate a document and a passage in this use case, sodocid
refers to the passage id.Install Requirements
Download the
requirements.txt
from pygaggle, then dopip install -r requirements.txt
.Download the Pre-trained Reranker
Download BERT_Base_trained_on_MSMARCO.zip (roughly 1.1 G) from nyu-dl/dl4marco-bert.
Unzip and save to a path called
[BERT_BASE_PSG_RETRIEVAL_MODEL_PATH]
.Note: Different transformer version tends to read pre-trained model names slightly differently. You might need to tweak file names a bit for error message such as
file not found
. For example, rename[BERT_BASE_PSG_RETRIEVAL_MODEL_PATH]
as lowercased,bert_config.json
toconfig.json
, andmodel.ckpt-100000.index
tomodel.ckpt.index
.Run Reranker with
rerank_custom_collection.py
[RERANKER_OUTPUT_PATH]
is the rerank output file.Each line has the format of
qid[\t]query_text[\t]docid[\t]passage_text[\t]score[\n]
.Screen results again, and iterate on this workflow!