Fine-tuning drops accuracy

cmark commented 1 year ago

Hi All,

I'm trying to fine-tune an existing sentence-transformer model (all-MiniLM-L6-v2) to get better scores in my sentence similarity problem. Test data shows ~70% accuracy and I'd like to improve that slightly.

I was able to generate four different datasets but fine-tuning using any of them drops accuracy.

400k pairs of sentences with similarity scores, using the CosineSimilarityLoss, but the accuracy during test drops to 50%.
400k triplets of anchor, positive, negative, using the TripletLoss with default settings, but the accuracy still drops to 50%.
With my other two 1.4 million and ~4 million triplets datasets I experience the same

I'm using a training script very similar to this (just loading the training data from CSV and configuring the model, loss and then calling fit): https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py

Tried to play with the number of epochs (from 1 to 10), learning_rate, weight_decay, but so far without any luck.

Am I doing this right? Is this an overfitting issue? Do I have to continue training the pretrained base model instead of fine-tuning?

UPDATE: by freezing all layers except the last two and the pooling I was able to train further without losing too much accuracy, but after selecting more than 25k training data the accuracy starts to deteriorate over time and reach the same 40-50% final result or even lower if I increase the number of epochs.

Thanks, Mark

ddofer commented 1 year ago

What LR range did you use? (OOM). And poolings? My experience with finetuning the sentence transformers hasn't been great honestly, I got the best results by the default embeddings + finetuning a classifier on top. YMMV

cmark commented 1 year ago

Hi @ddofer,

We have used the default LR configured in the model.fit method, 2e-05, and tried other values as well, like 2e-04, 1e-05, 1e-04.

The pooling layer was the default included in the all-MiniLM-L6-v2 model. We have tried a couple of loss functions, but the best one was the TripletLoss for the kind of task we are looking for. Our use cases mostly to have better scores for certain phrases/words than the base score given by the all-MiniLM-L6-v2 model. The model itself is really good and gives way better scores than our previous BM25 keyword search implementation, but we would like to get that improved a bit in certain scenarios.

Also, we have followed this HF guide which basically describes the same model with some example sentences being fine-tuned.

Thanks!

UKPLab / sentence-transformers

Fine-tuning drops accuracy #1700