nanoColBERT

This repo provides a simple implementation of ColBERT-v1 model.

The official github repo: Link (v1 branch)

ColBERT is a powerful late-interaction model that could perform both retrieval and reranking. ColBERT

Get Started

conda create -n nanoColBERT python=3.8 && conda activate nanoColBERT
## install torch and faiss according to your CUDA version
pip install -r requirements.txt

Configure wandb and accelerate

wandb login
accelerate config

After everything setup, just launch the whole process with:

(if the download link is expired, please refer to https://github.com/Hannibal046/nanoColBERT/issues/5, https://github.com/Hannibal046/nanoColBERT/issues/2)

bash scripts/download.sh
bash scripts/run_colbert.sh

It would first download the data, preprocess the data, train the model, index with faiss, conduct retrieval and calculate the score.

Results

This is our reproduced results:

	MRR@10	Recall@50	Recall@200	Recall@1000
Reported	36.0	82.9	92.3	96.8
nanoColBERT	36.0	83.3	91.9	96.3

Please be aware that this repository serves solely as a conceptual guide and has not been heavily optimized for efficiency

The following reveals the duration of each step:	Step	Duration
tsv2mmap	3h5min
train	8h54min	400k steps on 1*A100
doc2emebdding	56min	8*A100
build_index	21min	30% training data with IVFPQ on 1*A100
retrieve	17min	6980 samples on 1*A100

Pretrained Ckpt

We also provide our trained model on the Huggingface Space and you could simply use it with:

from model import ColBERT
from transformers import BertTokenizer

pretrained_model = "nanoColBERT/ColBERTv1"
model = ColBERT.from_pretrained(pretrained_model)
tokenizer = BertTokenizer.from_pretrained(pretrained_model)

Hannibal046 / nanoColBERT

readme

nanoColBERT

Get Started

Results

Pretrained Ckpt