Training configuration for ColBERT

beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

http://beir.ai

Apache License 2.0

1.55k stars 186 forks source link

Training configuration for ColBERT #68

Closed jihyukkim-nlp closed 2 years ago

jihyukkim-nlp commented 2 years ago

Thank you for sharing this work!

Could you share the training configuration for ColBERT retriever?

batch size, optimizer, and learning rate used for training
training triples (Did you use official train triples of MSMARCO without ANCE-style hard negative mining?)

Thanks in advance

thakur-nandan commented 2 years ago

Hi @jihyukkim-nlp,

For training the ColBERT retriever, we used the same training configurations as the original default training code mentioned here: https://github.com/stanford-futuredata/ColBERT#training, with just one change of --doc_maxlen 300 instead of 180.

Our training triplets were official MSMARCO train triplets.

Kind Regards, Nandan Thakur

jihyukkim-nlp commented 2 years ago

Thanks for the heads-up :)

thakur-nandan commented 2 years ago

If it helps, you can find my ColBERT model checkpoint here: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/models/ColBERT/msmarco.psg.l2.zip

Kind Regards, Nandan Thakur

jihyukkim-nlp commented 2 years ago

Thank you for sharing. I found the url in the paper. This helped me a lot.

But, I wanted to further analyze training process of ColBERT under the same training configuration. Now, I can do this. Thanks!

Best regards, Jihyuk Kim

cramraj8 commented 9 months ago

Hi @thakur-nandan , is there a reference on how many NUM_PARTITIONS for ColBERT faiss search is used for each BEIR datasets ? The default is set to 32768 in their original repo, but 96 was given in your evaluation script (https://github.com/thakur-nandan/beir-ColBERT/blob/91190882deac1792c78b3c33d51be9edaa9c6805/evaluate_beir.sh#L26) . I wonder if that was changed for each datasets.