Dense Hybrid Retrieval

In this repo, we introduce two approaches to training transformers to capture semantic and lexical text representations for robust dense passage retrieval.

Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval Sheng-Chieh Lin, Minghan Li and Jimmy Lin. (TACL just accepted)
A Dense Representation Framework for Lexical and Semantic Matching Sheng-Chieh Lin and Jimmy Lin. (TOIS 2021 in press)

This repo contains three parts: (1) densify (2) training (tevatron) (3) retrieval. Our training code is mainly from Tevatron with a minor revision.

Requirements

pip install torch>=1.7.0
pip install transformers==4.15.0
pip install pyserini
pip install beir

Huggingface Checkpoints

Model	Initialization	MARCO Dev	BEIR (13 public datasets)	Huggingface Path	Document
DeLADE+[CLS] plus	distilbert-base-uncased	37.1	49.8	jacklin/DeLADE-CLS-P	Read Me
DeLADE+[CLS]	distilbert-base-uncased	35.7	48.5	jacklin/DeLADE-CLS	Read Me
Aggretriever	distilbert-base-uncased	34.1	46.0	jacklin/DistilBERT-AGG	Read Me

Aggretriever

In this paper, we introduce a simple approach to aggregating token-level information into a single-vector dense representation. We provide instruction for model training and evaluation on MS MARCO passage ranking dataset in the document. We also provide instruction for the evaluation on BEIR datasets in the document.

A Dense Representation Framework for Lexical and Semantic Matching

In this paper, we introduce a unified representation framework for Lexical and Semantic Matching. We first introduce how to use our framework to conduct retrieval for high-dimensional (lexcial) representations and combine with single-vector dense (semantic) representations for hybrid search.

Dense Lexical Retrieval

We can densify any existing lexical matching models and conduct lexical matching on GPU. In the document, we demonstrate how to conduct BM25 and uniCOIL end-to-end retrieval under our framework. Detailed description can be found in our paper.