Passage reranking for custom collection

ghost commented 4 years ago

Hi,

It might be good to complete the custom collection doc with light instructions on passage reranking, after seeing this issue.

pygaggle has been a well-bundled and transparent resource for the CORD-19, and for other text ranking in the future. Below is a bottom-up snapshot from what it provides. It is however free-style and therefore less formalized.

Rerank with monoBERT

Optionally, you could rerank the above retrieval results. We provide a minimum working example rerank_custom_collection.py for this.

The example follows pygaggle and duoBERT (up to monoBERT in the figure below, figure source here), less the part to evaluate with ground-truth.

It calls a reranker to score (query, passage) pairs. The reranker is a pre-trained transformer model on a general passage retrieval task such as MS MARCO.

monoBERT

Prepare Input Files from the QuickStart Step

The initial retrieval result file, named above as [OUTPUT_PATH].
The query file, named above as [QUERY_FILE_PATH].
A mapping file between the passage id and the raw content mapping, named as [PASSAGE_ID2TEXT_PATH].
- This mapping file does not exist above. Write a simple script to convert the initial collection to this mapping file.
- Each line has the format of docid[\t]passage_raw_text[\n]. No header. We don't differentiate a document and a passage in this use case, so docid refers to the passage id.

Install Requirements

Download the requirements.txt from pygaggle, then do pip install -r requirements.txt.

Download the Pre-trained Reranker

Download BERT_Base_trained_on_MSMARCO.zip (roughly 1.1 G) from nyu-dl/dl4marco-bert.

Unzip and save to a path called [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH].

Note: Different transformer version tends to read pre-trained model names slightly differently. You might need to tweak file names a bit for error message such as file not found. For example, rename [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH] as lowercased, bert_config.json to config.json, and model.ckpt-100000.index to model.ckpt.index.

Run Reranker with rerank_custom_collection.py

python rerank_custom_collection.py --search_output_file [OUTPUT_PATH] \
            --qid2query_file [QUERY_FILE_PATH] \
            --passage_text_file [PASSAGE_ID2TEXT_PATH] \
            --model_name_or_path [BERT_BASE_PSG_RETRIEVAL_MODEL_PATH] \
            --device [your_device_setting] --output_path [RERANKER_OUTPUT_PATH]

[RERANKER_OUTPUT_PATH] is the rerank output file.

Each line has the format of qid[\t]query_text[\t]docid[\t]passage_text[\t]score[\n].

Screen results again, and iterate on this workflow!

lintool commented 4 years ago

hey @egzhbdt cool, thanks for this!

After chatting with the team, we think this doc might be best in pygaggle itself? Maybe you could start a docs/ directory for us and drop this in there? Please send PR. We could have mutual pointers between docs in anserini and pygaggle?

lintool commented 4 years ago

ref: https://github.com/castorini/pygaggle/issues/21

lintool commented 4 years ago

With the recent addition of bindings for the indexer in Python, I think this is resolved.

Fatima-200159617 commented 4 years ago

Hi, I am trying to find the rerank_custom_collection.py. It seems the page is not available. Please advice.

lintool commented 4 years ago

@ghost ?

Fatima-200159617 commented 4 years ago

I appreciate if you can update on my request above.

Fatima-200159617 commented 4 years ago

Hi @Fatima-200159617 , apologize for the confusion and thanks for this. It was rerank_custom_collection.py.

Great thanks.

castorini / anserini

Passage reranking for custom collection #1209

Rerank with monoBERT