THUIR / T2Ranking

T2Ranking: A large-scale Chinese benchmark for passage ranking.
https://huggingface.co/datasets/THUIR/T2Ranking
142 stars 9 forks source link

T2Ranking

Introduction

T2Ranking is a large-scale Chinese benchmark for passage ranking. The details about T2Ranking are elaborated in this paper.

Passage ranking are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). The goal of passage ranking is to compile a search result list ordered in terms of relevance to the query from a large passage collection. Typically, Passage ranking involves two stages: passage retrieval and passage re-ranking.

To support the passage ranking research, various benchmark datasets are constructed. However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues.

To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real- world search engines. Specifically, we sample question-based search queries from user logs of the Sogou search engine, a popular search system in China. For each query, we extract the content of corresponding documents from different search engines. After model-based passage segmentation and clustering-based passage de-duplication, a large-scale passage corpus is obtained. For a given query and its corresponding passages, we hire expert annotators to provide 4-level relevance judgments of each query-passage pair.

Table 1: The data statistics of datasets commonly used in passage ranking. FR(SR): First (Second)- stage of passage ranking, i.e., passage Retrieval (Re-ranking).

Compared with existing datasets, T2Ranking dataset has the following characteristics and advantages:

Data Download

The whole dataset is placed in huggingface, and the data formats are presented in the following table.

| Description| Filename|Num Records|Format| |-------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|----------:|-----------------------------------:| | Collection | collection.tsv | 2,303,643 | tsv: pid, passage | | Queries Train | queries.train.tsv | 258,042 | tsv: qid, query | | Queries Dev | queries.dev.tsv | 24,832 | tsv: qid, query | | Queries Test | queries.test.tsv | 24,832 | tsv: qid, query | | Qrels Train for re-ranking | qrels.train.tsv | 1,613,421 | TREC qrels format | | Qrels Dev for re-ranking | qrels.dev.tsv | 400,536 | TREC qrels format | | Qrels Retrieval Train | qrels.retrieval.train.tsv | 744,663 | tsv: qid, pid | | Qrels Retrieval Dev | qrels.retrieval.dev.tsv | 118,933 | tsv: qid, pid | | BM25 Negatives | train.bm25.tsv | 200,359,731 | tsv: qid, pid, index | | Hard Negatives | train.mined.tsv | 200,376,001 | tsv: qid, pid, index, score |

You can download the dataset by running the following command:

git lfs install
git clone https://huggingface.co/datasets/THUIR/T2Ranking

After downloading, you can find the following files in the folder:

├── data
│   ├── collection.tsv
│   ├── qrels.dev.tsv
│   ├── qrels.retrieval.dev.tsv
│   ├── qrels.retrieval.train.tsv
│   ├── qrels.train.tsv
│   ├── queries.dev.tsv
│   ├── queries.test.tsv
│   ├── queries.train.tsv
│   ├── train.bm25.tsv
│   └── train.mined.tsv
├── script
│   ├── train_cross_encoder.sh
│   └── train_dual_encoder.sh
└── src
    ├── convert2trec.py
    ├── dataset_factory.py
    ├── modeling.py
    ├── msmarco_eval.py
    ├── train_cross_encoder.py
    ├── train_dual_encoder.py
    └── utils.py

Training and Evaluation

The dual-encoder can be trained by running the following command:

sh script/train_dual_encoder.sh

After training the model, you can evaluate the model by running the following command:

python src/msmarco_eval.py data/qrels.retrieval.dev.tsv output/res.top1000.step20

The cross-encoder can be trained by running the following command:

sh script/train_cross_encoder.sh

After training the model, you can evaluate the model by running the following command:

python src/convert2trec.py output/res.step-20 && python src/msmarco_eval.py data/qrels.retrieval.dev.tsv output/res.step-20.trec && path_to/trec_eval -m ndcg_cut.5 data/qrels.dev.tsv res.step-20.trec

We have uploaded some checkpoints to Huggingface Hub.

Model Description Link
dual-encoder 1 dual-encoder trained with bm25 negatives DE1
dual-encoder 2 dual-encoder trained with self-mined hard negatives DE2
cross-encoder cross-encoder trained with self-mined hard negatives CE

BM25 on DEV set

#####################
MRR @10: 0.35894801237316354
QueriesRanked: 24831
recall@1: 0.05098711868967141
recall@1000: 0.7464097131133757
recall@50: 0.4942572226146033
#####################

DPR trained with BM25 negatives on DEV set

#####################
MRR @10: 0.4856112079562753
QueriesRanked: 24831
recall@1: 0.07367235058688999
recall@1000: 0.9082753169878586
recall@50: 0.7099350889583964
#####################

DPR trained with self-mined hard negatives on DEV set

#####################
MRR @10: 0.5166915171959451
QueriesRanked: 24831
recall@1: 0.08047455688965123
recall@1000: 0.9135220125786163
recall@50: 0.7327044025157232
#####################

BM25 retrieved+CE reranked on DEV set

The reranked run file is placed in here.

#####################
MRR @10: 0.5188107959009376
QueriesRanked: 24831
recall@1: 0.08545219116806242
recall@1000: 0.7464097131133757
recall@50: 0.595298153566744
#####################
ndcg_cut_20             all     0.4405
ndcg_cut_100            all     0.4705
#####################

DPR retrieved+CE reranked on DEV set

The reranked run file is placed in here.

#####################
MRR @10: 0.5508822816845231
QueriesRanked: 24831
recall@1: 0.08903406988867588
recall@1000: 0.9135220125786163
recall@50: 0.7393720781623112
#####################
ndcg_cut_20             all     0.5131
ndcg_cut_100            all     0.5564
#####################

License

The dataset is licensed under the Apache License 2.0.

Citation

If you use this dataset in your research, please cite our paper:

@misc{xie2023t2ranking,
      title={T2Ranking: A large-scale Chinese Benchmark for Passage Ranking}, 
      author={Xiaohui Xie and Qian Dong and Bingning Wang and Feiyang Lv and Ting Yao and Weinan Gan and Zhijing Wu and Xiangsheng Li and Haitao Li and Yiqun Liu and Jin Ma},
      year={2023},
      eprint={2304.03679},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}