Open Akshayextreme opened 4 years ago
Hi @Akshayextreme
I can recommend to start with the smaller models.
Large models (independent which) are know that they from time to time fail to converge. This is not limited to fine-tuning sentence embeddings, but also for any other classification task. Training these models then with another random seed will yield good results.
Out of 10 random seeds, sometimes 1 run fails, sometimes 4 runs fail to learn something.
With the smaller models, the issue occurs less often.
Thanks a lot @nreimers for the great advice.
I tried with ALBERT-base-V2
and below are logs. This is training on AllNLI with 1 epoch.
2020-06-09 13:36:53 - Cosine-Similarity : Pearson: 0.7272 Spearman: 0.7489 2020-06-09 13:36:53 - Manhattan-Distance: Pearson: 0.7409 Spearman: 0.7341 2020-06-09 13:36:53 - Euclidean-Distance: Pearson: 0.7466 Spearman: 0.7403 2020-06-09 13:36:53 - Dot-Product-Similarity: Pearson: 0.6769 Spearman: 0.6548
In coming 2 days, I am doing some experiments on training using ALBERT model. I will post it here if I find anything useful. Then I would close this thread.
ALBERT-base-V2
Fine-tuned on STSb
for 4 epochs
2020-06-09 15:15:07 - Cosine-Similarity : Pearson: 0.7880 Spearman: 0.7861 2020-06-09 15:15:07 - Manhattan-Distance: Pearson: 0.7558 Spearman: 0.7592 2020-06-09 15:15:07 - Euclidean-Distance: Pearson: 0.7634 Spearman: 0.7657 2020-06-09 15:15:07 - Dot-Product-Similarity: Pearson: 0.7393 Spearman: 0.7338
Just FYI [2101.10642v1] Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks According to the paper, CNN based structure instead of average pooling is more good performance with ALBERT.
I have gone through older issues and @nreimers has pointed out many times that ALBERT model does not perform quite good with sentence-transformers. I am absolutely fine with ~5-10 points less performance than BERT but after training ALBERT for 1 epoch on AllNLI dataset I got awful results.
ALBERT-large-V1 2020-06-08 18:20:28 - Cosine-Similarity : Pearson: 0.1973 Spearman: 0.2404 2020-06-08 18:20:28 - Manhattan-Distance: Pearson: 0.2318 Spearman: 0.2411 2020-06-08 18:20:28 - Euclidean-Distance: Pearson: 0.2313 Spearman: 0.2408 2020-06-08 18:20:28 - Dot-Product-Similarity: Pearson: 0.1437 Spearman: 0.1551
ALBERT-large-V2 2020-06-09 03:58:27 - Cosine-Similarity : Pearson: 0.0722 Spearman: 0.0633 2020-06-09 03:58:27 - Manhattan-Distance: Pearson: 0.1236 Spearman: 0.1089 2020-06-09 03:58:27 - Euclidean-Distance: Pearson: 0.1237 Spearman: 0.1090 2020-06-09 03:58:27 - Dot-Product-Similarity: Pearson: 0.1047 Spearman: 0.0900
I am using all default parameters mentioned in training script.
python /content/sentence-transformers/examples/training_transformers/training_nli.py 'albert-large-v1'
I checked
similarity_evaluation_results
file after fine-tuning. ForALBERT-large-V2
all values forcosine_pearson
arenan
and forALBERT-large-V1
after initial increase in value to 0.24 there is stagnation.It takes ~8 hrs on Google colab to fine tune ALBERT on AllNLI dataset. Any pointers to get at least respectable results? I am doing anything wrong here?