castorini / dhr

Dense hybrid representations for text retrieval
61 stars 5 forks source link

Dense Hybrid Retrieval

In this repo, we introduce two approaches to training transformers to capture semantic and lexical text representations for robust dense passage retrieval.

  1. Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval Sheng-Chieh Lin, Minghan Li and Jimmy Lin. (TACL just accepted)
  2. A Dense Representation Framework for Lexical and Semantic Matching Sheng-Chieh Lin and Jimmy Lin. (TOIS 2021 in press)

This repo contains three parts: (1) densify (2) training (tevatron) (3) retrieval. Our training code is mainly from Tevatron with a minor revision.

Requirements

pip install torch>=1.7.0
pip install transformers==4.15.0
pip install pyserini
pip install beir

Huggingface Checkpoints

Model Initialization MARCO Dev BEIR (13 public datasets) Huggingface Path Document
DeLADE+[CLS] plus distilbert-base-uncased 37.1 49.8 jacklin/DeLADE-CLS-P Read Me
DeLADE+[CLS] distilbert-base-uncased 35.7 48.5 jacklin/DeLADE-CLS Read Me
Aggretriever distilbert-base-uncased 34.1 46.0 jacklin/DistilBERT-AGG Read Me

Aggretriever

In this paper, we introduce a simple approach to aggregating token-level information into a single-vector dense representation. We provide instruction for model training and evaluation on MS MARCO passage ranking dataset in the document. We also provide instruction for the evaluation on BEIR datasets in the document.

A Dense Representation Framework for Lexical and Semantic Matching

In this paper, we introduce a unified representation framework for Lexical and Semantic Matching. We first introduce how to use our framework to conduct retrieval for high-dimensional (lexcial) representations and combine with single-vector dense (semantic) representations for hybrid search.

Dense Lexical Retrieval

We can densify any existing lexical matching models and conduct lexical matching on GPU. In the document, we demonstrate how to conduct BM25 and uniCOIL end-to-end retrieval under our framework. Detailed description can be found in our paper.

Dense Hybrid Retrieval

With the densified lexical representations, we can easily conduct lexical and semantic hybrid retrieval using independent neural models. A document for hybrid retrieval will be coming soon.

Dense Hybrid Representation Model

In our paper, we propose a single model fusion approach by training the lexical and semantic components of a transformer while inference, we combine the densified lexical representations and dense representations as dense hybrid representations. Instead of training by yourself, you can also download our trained DeLADE-CLS-P, DeLADE-CLS and DeLADE and directly peform inference on MSMARCO Passage dataset (see document) or BEIR datasets (see document).