beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Training configuration for ColBERT #68

Closed jihyukkim-nlp closed 2 years ago

jihyukkim-nlp commented 2 years ago

Thank you for sharing this work!

Could you share the training configuration for ColBERT retriever?

Thanks in advance

thakur-nandan commented 2 years ago

Hi @jihyukkim-nlp,

For training the ColBERT retriever, we used the same training configurations as the original default training code mentioned here: https://github.com/stanford-futuredata/ColBERT#training, with just one change of --doc_maxlen 300 instead of 180.

Our training triplets were official MSMARCO train triplets.

Kind Regards, Nandan Thakur

jihyukkim-nlp commented 2 years ago

Thanks for the heads-up :)

thakur-nandan commented 2 years ago

If it helps, you can find my ColBERT model checkpoint here: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/models/ColBERT/msmarco.psg.l2.zip

Kind Regards, Nandan Thakur

jihyukkim-nlp commented 2 years ago

Thank you for sharing. I found the url in the paper. This helped me a lot.

But, I wanted to further analyze training process of ColBERT under the same training configuration. Now, I can do this. Thanks!

Best regards, Jihyuk Kim

cramraj8 commented 9 months ago

Hi @thakur-nandan , is there a reference on how many NUM_PARTITIONS for ColBERT faiss search is used for each BEIR datasets ? The default is set to 32768 in their original repo, but 96 was given in your evaluation script (https://github.com/thakur-nandan/beir-ColBERT/blob/91190882deac1792c78b3c33d51be9edaa9c6805/evaluate_beir.sh#L26) . I wonder if that was changed for each datasets.