Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch
https://unbabel.github.io/OpenKiwi/
GNU Affero General Public License v3.0
228 stars 48 forks source link

Poor results when training Estimator with parallel data and TER scores #92

Open lluisg opened 3 years ago

lluisg commented 3 years ago

Hi!

We are trying to train a EN-FR sentence level QE model by using a predictor estimator model with parallel data.

We are using OpenKiwi 0.1.3 to train it.

The procedure was as follows:

  1. Train the Predictor using parallel data (EN-FR)
  2. Train the Estimator using the Predictor (from step 1) using the following data (as commented in thread https://github.com/Unbabel/OpenKiwi/issues/46): a.the English source sentences b.the FR translated sentences using a pretrained MT model c.the TER scores for each FR sentence translated

The results obtained were of a Pearson correlation of 0.32 and a Spearman correlation of 0.36, which are below the 0.5018 and 0.5566 obtained on the OpenKiwi paper (https://www.aclweb.org/anthology/P19-3020.pdf).

My question is: is it possible to obtain a similar result using only parallel data? If affirmative there is something wrong on our procedure?

The configuration files used to train are the following:

#predictor_config-enfr.yml
checkpoint-early-stop-patience: 0
checkpoint-keep-only-best: 2
checkpoint-save: true
checkpoint-validation-steps: 50000
dropout-pred: 0.5
embedding-sizes: 200
epochs: 5
experiment-name: Pretrain Predictor
gpu-id: 0
hidden-pred: 400
learning-rate: 2e-3
learning-rate-decay: 0.6
learning-rate-decay-start: 2
log-interval: 100
model: predictor
optimizer: adam
out-embeddings-size: 200
output-dir: runs/predictor-enfr
predict-inverse: false
rnn-layers-pred: 2
source-embeddings-size: 200
source-max-length: 50
source-min-length: 1
source-vocab-min-frequency: 1
source-vocab-size: 45000
split: 0.9
target-embeddings-size: 200
target-max-length: 50
target-min-length: 1
target-vocab-min-frequency: 1
target-vocab-size: 45000
train-batch-size: 16
train-source: custom_data/train-enfr.src
train-target: custom_data/train-enfr.tgt
valid-batch-size: 16
valid-source: custom_data/dev-enfr.src
valid-target: custom_data/dev-enfr.tgt
#estimator_config-enfr.yml
binary-level: false
checkpoint-early-stop-patience: 0
checkpoint-keep-only-best: 2
checkpoint-save: true
checkpoint-validation-steps: 0
dropout-est: 0.0
epochs: 5
experiment-name: Train Estimator
gpu-id: 0
hidden-est: 125
learning-rate: 2e-3
load-pred-target: runs/predictor-enfr/best_model.torch
log-interval: 100
mlp-est: true
model: estimator
output-dir: runs/estimator-enfr
predict-gaps: false
predict-source: false
predict-target: false
rnn-layers-est: 1
sentence-level: true
sentence-ll: false
source-bad-weight: 2.5
target-bad-weight: 2.5
token-level: false
train-batch-size: 16
train-sentence-scores: custom_data/train-enfr.ter
train-source: custom_data/train-enfr.src
train-target: custom_data/train-enfr.pred
valid-batch-size: 16
valid-sentence-scores: custom_data/dev-enfr.ter
valid-source: custom_data/dev-enfr.src
valid-target: custom_data/dev-enfr.pred
wmt18-format: false
#predictions_config-enfr.yml
gpu-id: 0
load-model: runs/estimator-enfr/best_model.torch
model: estimator
output-dir: predictions/predest-enfr
seed: 42
test-source: custom_data/test-enfr.src
test-target: custom_data/test-enfr.pred
valid-batch-size: 64
wmt18-format: false
captainvera commented 3 years ago

Hello @lluisg, sorry for the (very) late response!

Everything seems alright in your settings and proposed setup. You could play a bit more with the hyper-params but nothing jumps out to me as obviously wrong.

As for your original question "is it possible to obtain a similar result using only parallel data? " I do not know! It is definitely an interesting research question!

Traditionally the community as believed that the multi-task nature of the normal QE setup helps it with both tasks, as HTER and the tag creation are inherently correlated, but who knows, maybe it is possible to get as good results with just parallel data? I would be interested in knowing about your results!

P.S. Is there any specific reason as for why you are using Openkiwi 0.1.3 instead of Openkiwi >2.0 ?