This repo provides a simple implementation of ColBERT-v1 model.
The official github repo: Link (v1 branch)
ColBERT is a powerful late-interaction model that could perform both retrieval and reranking.
conda create -n nanoColBERT python=3.8 && conda activate nanoColBERT
## install torch and faiss according to your CUDA version
pip install -r requirements.txt
Configure wandb
and accelerate
wandb login
accelerate config
After everything setup, just launch the whole process with:
(if the download link is expired, please refer to https://github.com/Hannibal046/nanoColBERT/issues/5, https://github.com/Hannibal046/nanoColBERT/issues/2)
bash scripts/download.sh
bash scripts/run_colbert.sh
It would first download the data, preprocess the data, train the model, index with faiss, conduct retrieval and calculate the score.
This is our reproduced results:
MRR@10 | Recall@50 | Recall@200 | Recall@1000 | |
---|---|---|---|---|
Reported | 36.0 | 82.9 | 92.3 | 96.8 |
nanoColBERT | 36.0 | 83.3 | 91.9 | 96.3 |
Please be aware that this repository serves solely as a conceptual guide and has not been heavily optimized for efficiency
The following reveals the duration of each step: | Step | Duration | Remark |
---|---|---|---|
tsv2mmap | 3h5min | ||
train | 8h54min | 400k steps on 1*A100 | |
doc2emebdding | 56min | 8*A100 | |
build_index | 21min | 30% training data with IVFPQ on 1*A100 | |
retrieve | 17min | 6980 samples on 1*A100 |
We also provide our trained model on the Huggingface Space and you could simply use it with:
from model import ColBERT
from transformers import BertTokenizer
pretrained_model = "nanoColBERT/ColBERTv1"
model = ColBERT.from_pretrained(pretrained_model)
tokenizer = BertTokenizer.from_pretrained(pretrained_model)