UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.87k stars 2.44k forks source link

Fine tuning using ALBERT #263

Open Akshayextreme opened 4 years ago

Akshayextreme commented 4 years ago

I have gone through older issues and @nreimers has pointed out many times that ALBERT model does not perform quite good with sentence-transformers. I am absolutely fine with ~5-10 points less performance than BERT but after training ALBERT for 1 epoch on AllNLI dataset I got awful results.

ALBERT-large-V1 2020-06-08 18:20:28 - Cosine-Similarity : Pearson: 0.1973 Spearman: 0.2404 2020-06-08 18:20:28 - Manhattan-Distance: Pearson: 0.2318 Spearman: 0.2411 2020-06-08 18:20:28 - Euclidean-Distance: Pearson: 0.2313 Spearman: 0.2408 2020-06-08 18:20:28 - Dot-Product-Similarity: Pearson: 0.1437 Spearman: 0.1551

ALBERT-large-V2 2020-06-09 03:58:27 - Cosine-Similarity : Pearson: 0.0722 Spearman: 0.0633 2020-06-09 03:58:27 - Manhattan-Distance: Pearson: 0.1236 Spearman: 0.1089 2020-06-09 03:58:27 - Euclidean-Distance: Pearson: 0.1237 Spearman: 0.1090 2020-06-09 03:58:27 - Dot-Product-Similarity: Pearson: 0.1047 Spearman: 0.0900

I am using all default parameters mentioned in training script. python /content/sentence-transformers/examples/training_transformers/training_nli.py 'albert-large-v1'

I checked similarity_evaluation_results file after fine-tuning. For ALBERT-large-V2 all values for cosine_pearson are nan and for ALBERT-large-V1 after initial increase in value to 0.24 there is stagnation.

It takes ~8 hrs on Google colab to fine tune ALBERT on AllNLI dataset. Any pointers to get at least respectable results? I am doing anything wrong here?

nreimers commented 4 years ago

Hi @Akshayextreme

I can recommend to start with the smaller models.

Large models (independent which) are know that they from time to time fail to converge. This is not limited to fine-tuning sentence embeddings, but also for any other classification task. Training these models then with another random seed will yield good results.

Out of 10 random seeds, sometimes 1 run fails, sometimes 4 runs fail to learn something.

With the smaller models, the issue occurs less often.

Akshayextreme commented 4 years ago

Thanks a lot @nreimers for the great advice.

I tried with ALBERT-base-V2 and below are logs. This is training on AllNLI with 1 epoch.

2020-06-09 13:36:53 - Cosine-Similarity : Pearson: 0.7272 Spearman: 0.7489 2020-06-09 13:36:53 - Manhattan-Distance: Pearson: 0.7409 Spearman: 0.7341 2020-06-09 13:36:53 - Euclidean-Distance: Pearson: 0.7466 Spearman: 0.7403 2020-06-09 13:36:53 - Dot-Product-Similarity: Pearson: 0.6769 Spearman: 0.6548

Akshayextreme commented 4 years ago

In coming 2 days, I am doing some experiments on training using ALBERT model. I will post it here if I find anything useful. Then I would close this thread.

Akshayextreme commented 4 years ago

ALBERT-base-V2 Fine-tuned on STSb for 4 epochs

2020-06-09 15:15:07 - Cosine-Similarity : Pearson: 0.7880 Spearman: 0.7861 2020-06-09 15:15:07 - Manhattan-Distance: Pearson: 0.7558 Spearman: 0.7592 2020-06-09 15:15:07 - Euclidean-Distance: Pearson: 0.7634 Spearman: 0.7657 2020-06-09 15:15:07 - Dot-Product-Similarity: Pearson: 0.7393 Spearman: 0.7338

knok commented 3 years ago

Just FYI [2101.10642v1] Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks According to the paper, CNN based structure instead of average pooling is more good performance with ALBERT.