Fine tuning using ALBERT

Akshayextreme commented 4 years ago

I have gone through older issues and @nreimers has pointed out many times that ALBERT model does not perform quite good with sentence-transformers. I am absolutely fine with ~5-10 points less performance than BERT but after training ALBERT for 1 epoch on AllNLI dataset I got awful results.

ALBERT-large-V1 2020-06-08 18:20:28 - Cosine-Similarity : Pearson: 0.1973 Spearman: 0.2404 2020-06-08 18:20:28 - Manhattan-Distance: Pearson: 0.2318 Spearman: 0.2411 2020-06-08 18:20:28 - Euclidean-Distance: Pearson: 0.2313 Spearman: 0.2408 2020-06-08 18:20:28 - Dot-Product-Similarity: Pearson: 0.1437 Spearman: 0.1551

ALBERT-large-V2 2020-06-09 03:58:27 - Cosine-Similarity : Pearson: 0.0722 Spearman: 0.0633 2020-06-09 03:58:27 - Manhattan-Distance: Pearson: 0.1236 Spearman: 0.1089 2020-06-09 03:58:27 - Euclidean-Distance: Pearson: 0.1237 Spearman: 0.1090 2020-06-09 03:58:27 - Dot-Product-Similarity: Pearson: 0.1047 Spearman: 0.0900

I am using all default parameters mentioned in training script. python /content/sentence-transformers/examples/training_transformers/training_nli.py 'albert-large-v1'

I checked similarity_evaluation_results file after fine-tuning. For ALBERT-large-V2 all values for cosine_pearson are nan and for ALBERT-large-V1 after initial increase in value to 0.24 there is stagnation.

It takes ~8 hrs on Google colab to fine tune ALBERT on AllNLI dataset. Any pointers to get at least respectable results? I am doing anything wrong here?

nreimers commented 4 years ago

Hi @Akshayextreme

I can recommend to start with the smaller models.

Large models (independent which) are know that they from time to time fail to converge. This is not limited to fine-tuning sentence embeddings, but also for any other classification task. Training these models then with another random seed will yield good results.

Out of 10 random seeds, sometimes 1 run fails, sometimes 4 runs fail to learn something.

With the smaller models, the issue occurs less often.

Akshayextreme commented 4 years ago

Thanks a lot @nreimers for the great advice.

I tried with ALBERT-base-V2 and below are logs. This is training on AllNLI with 1 epoch.

2020-06-09 13:36:53 - Cosine-Similarity : Pearson: 0.7272 Spearman: 0.7489 2020-06-09 13:36:53 - Manhattan-Distance: Pearson: 0.7409 Spearman: 0.7341 2020-06-09 13:36:53 - Euclidean-Distance: Pearson: 0.7466 Spearman: 0.7403 2020-06-09 13:36:53 - Dot-Product-Similarity: Pearson: 0.6769 Spearman: 0.6548

Akshayextreme commented 4 years ago

In coming 2 days, I am doing some experiments on training using ALBERT model. I will post it here if I find anything useful. Then I would close this thread.

Akshayextreme commented 4 years ago

ALBERT-base-V2 Fine-tuned on STSb for 4 epochs

2020-06-09 15:15:07 - Cosine-Similarity : Pearson: 0.7880 Spearman: 0.7861 2020-06-09 15:15:07 - Manhattan-Distance: Pearson: 0.7558 Spearman: 0.7592 2020-06-09 15:15:07 - Euclidean-Distance: Pearson: 0.7634 Spearman: 0.7657 2020-06-09 15:15:07 - Dot-Product-Similarity: Pearson: 0.7393 Spearman: 0.7338

knok commented 3 years ago

Just FYI ［2101.10642v1］ Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks According to the paper, CNN based structure instead of average pooling is more good performance with ALBERT.

UKPLab / sentence-transformers

Fine tuning using ALBERT #263