Training process hyper parameters suggestion

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.78k stars 2.43k forks source link

Training process hyper parameters suggestion #1290

Open ewayuan opened 2 years ago

ewayuan commented 2 years ago

Hi, I'm planning to continue training bi-encoder based on the all-mpnet-base-v2 model, my own model on my own training dataset, which contains over one million training examples.

Would you give me any suggestion on the number of epochs (Is that higher is better? how many epochs is good for one million training examples?), learning rate and batch size? Maybe also which loss function is better?

nreimers commented 2 years ago

If you use MultipleNegativesRankingLoss, the batch size is important: the larger, the better.

All other hyper parameters are rather unimportant

ewayuan commented 2 years ago

If you use MultipleNegativesRankingLoss, the batch size is important: the larger, the better.

All other hyper parameters are rather unimportant

How about the CosineSimilarityLoss? Is the number of epochs important? For some data, 1 epoch is better than 10 epochs, but for another dataset, 10 epochs is better than 1 epoch. How could I find the proper number of epochs?

nreimers commented 2 years ago

There a batch size of 32 is sufficient. Check the scores on the dev set to see how long you have to train

ewayuan commented 2 years ago

After I changed CosineSimilarityLoss to MultipleNegativesRankingLoss, with noDuplicateDataLoader, the performance dropped siginificantly