UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.26k stars 2.47k forks source link

train bi-encoder with MS MARCO #1060

Open zbrnwpu opened 3 years ago

zbrnwpu commented 3 years ago

hello,@nreimers,I would like to ask you how long it took you to train the model of the MS MARCO data. I used the Tesla V100 to train that requires 40h for one epoch.I am a beginner, thank you for your answers!

nreimers commented 3 years ago

Which version of the training do you use?

In the latest version, 1 epoch are one iteration over the 500k train queries. On a V100, it takes less than an hour for 1 epoch.

In Version 2, 1 epoch had nearly 400 million train triplets. There you trained just for some time and then stopped the process.

zbrnwpu commented 3 years ago

I am using version 3, and I found that the training data of version 3 is also a triplet. I only changed the data loading code. The rest of the code is the same as yours. My training data has 2456171 triples. Due to GPU memory limitations, batchsize can only be set to 10.@nreimers

nreimers commented 3 years ago

With batch size 10 - not sure if your model will be good. Larger batch size => better model.

zbrnwpu commented 3 years ago

With batch size 10 - not sure if your model will be good. Larger batch size => better model. thank you very much,I also want to ask you, what is the difference between your train_bi_encoder_V2 and train_bi_encoder_V3 codes, thank you for your reply

nreimers commented 3 years ago

v2 uses hard negatives that are provided by the task organizers, which have been sourced using BM25.

v3 uses different systems to mine passages that are close to a query. A cross-encoder then scores them if they are relevant to the query or not.