STS Albert Base low performance

Deseram commented 4 years ago

Hi,

I'm been implementing you paper in Keras using Albert instead of Bert. I'm quite certain I've followed almost the exact setup presented in the paper in the code in this repo but I'm having difficulty reproducing good results for STS-Benchmark similar to bert.

Would be great if you can point out any issue with the configuration?

model: tensorflowhub keras layer v2

Setup Summary:

Normalized STS labels between 0 and 1
Siamese networks configuration:

Input -> Albert -> sentence embedding -> dropout (tried without as well) rate=0.1 -> Pool Gobal average with mask (input attention mask) -> cosine similarity -> mse

metric -> pearson correlation optimizer -> Albert optimizer with 10% warmup and learning rate of 2e-5

359/359 [==============================] - 20s 55ms/step - loss: 0.0567 - pearson_correlation_metric_fn: 0.4561 - val_loss: 0.0546 - val_pearson_correlation_metric_fn: 0.5484
Epoch 2/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0700 - pearson_correlation_metric_fn: 0.4495 - val_loss: 0.3668 - val_pearson_correlation_metric_fn: 0.0918
Epoch 3/20
359/359 [==============================] - 15s 41ms/step - loss: 0.1521 - pearson_correlation_metric_fn: 0.1316 - val_loss: 0.0979 - val_pearson_correlation_metric_fn: 0.2823
Epoch 4/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0699 - pearson_correlation_metric_fn: 0.3001 - val_loss: 0.0828 - val_pearson_correlation_metric_fn: 0.1667
Epoch 5/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0620 - pearson_correlation_metric_fn: 0.3538 - val_loss: 0.0736 - val_pearson_correlation_metric_fn: 0.2661
Epoch 6/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0553 - pearson_correlation_metric_fn: 0.3983 - val_loss: 0.0657 - val_pearson_correlation_metric_fn: 0.3763
Epoch 7/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0522 - pearson_correlation_metric_fn: 0.4136 - val_loss: 0.0634 - val_pearson_correlation_metric_fn: 0.3844
Epoch 8/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0453 - pearson_correlation_metric_fn: 0.4614 - val_loss: 0.0635 - val_pearson_correlation_metric_fn: 0.3737
Epoch 9/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0421 - pearson_correlation_metric_fn: 0.4875 - val_loss: 0.0759 - val_pearson_correlation_metric_fn: 0.3548
Epoch 10/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0385 - pearson_correlation_metric_fn: 0.4934 - val_loss: 0.0611 - val_pearson_correlation_metric_fn: 0.3495
Epoch 11/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0301 - pearson_correlation_metric_fn: 0.5850 - val_loss: 0.0677 - val_pearson_correlation_metric_fn: 0.3871
Epoch 12/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0245 - pearson_correlation_metric_fn: 0.6180 - val_loss: 0.0685 - val_pearson_correlation_metric_fn: 0.3952
Epoch 13/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0178 - pearson_correlation_metric_fn: 0.7040 - val_loss: 0.0651 - val_pearson_correlation_metric_fn: 0.4220
Epoch 14/20
359/359 [==============================] - 15s 43ms/step - loss: 0.0125 - pearson_correlation_metric_fn: 0.7476 - val_loss: 0.0636 - val_pearson_correlation_metric_fn: 0.4597
Epoch 15/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0093 - pearson_correlation_metric_fn: 0.7813 - val_loss: 0.0635 - val_pearson_correlation_metric_fn: 0.4489
Epoch 16/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0071 - pearson_correlation_metric_fn: 0.8102 - val_loss: 0.0638 - val_pearson_correlation_metric_fn: 0.4570
Epoch 17/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0057 - pearson_correlation_metric_fn: 0.8283 - val_loss: 0.0658 - val_pearson_correlation_metric_fn: 0.4462
Epoch 18/20
359/359 [==============================] - 15s 42ms/step - loss: 0.0049 - pearson_correlation_metric_fn: 0.8370 - val_loss: 0.0658 - val_pearson_correlation_metric_fn: 0.4435
Epoch 19/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0046 - pearson_correlation_metric_fn: 0.8437 - val_loss: 0.0633 - val_pearson_correlation_metric_fn: 0.4435
Epoch 20/20
359/359 [==============================] - 15s 41ms/step - loss: 0.0049 - pearson_correlation_metric_fn: 0.8524 - val_loss: 0.0646 - val_pearson_correlation_metric_fn: 0.4409

2020-06-12-112847_1398x467_scrot

I will attempt to use the huggingface model to see if theres any other issue. Thanks again for any advise you can give

nreimers commented 4 years ago

So far it sounds correct.

I also achieved not the best results with ALBERT. So yes, please try BERT (or RoBERTa), it gives the best sentence representations in my experience.

Also I can recommend to first fine-tune on NLI, and then to tune on STS data.

Deseram commented 4 years ago

wow.. appreciate the very quick response. When you tied ALBERT, did you get similar results or better? I'm developing a proof of concept using ALBERT as its a lighter model compared to Bert as the name suggests and did perform better slightly better than Bert. Do you recall the results you achieved with ALBERT?

nreimers commented 4 years ago

Hi @Deseram You can search the issues here, I think there are some concrete numbers posted.

I don't recall the exact numbers, but when trained on NLI, I got worse results than BERT. I think the performance drop was quite large (~10 points), but I might remember wrongly.

If you need a light model I can recommend DistillBERT, which performs nearly on par with BERT for sentence embeddings, but has only half the layers.

UKPLab / sentence-transformers

STS Albert Base low performance #269