Word level QE reproduction

zouharvi commented 5 years ago

As per the email we sent all of the paper authors we (@zouharvi, @obo) trained the predictor and then the estimator on our custom data, but the results were almost random.

Describe the bug While trying to find the problem we tried to reproduce the WMT result based on your pre-trained models, as mentioned in the documentation. There must be some systematic mistake we're making because the pre-trained estimator produces almost random results.

To Reproduce Run in an empty directory. The script downloads the model and then tries to estimate the quality of the first sentence from the training dataset in WMT18.

wget https://github.com/Unbabel/OpenKiwi/releases/download/0.1.1/en_de.nmt_models.zip
unzip -n en_de.nmt_models.zip

mkdir output input
echo "the part of the regular expression within the forward slashes defines the pattern ." > ./input/test.src
echo "der Teil des regulären Ausdrucks innerhalb der umgekehrten Schrägstrich definiert das Muster ." > ./input/test.trg

kiwi predict \
--config ./en_de.nmt_models/estimator/target_1/predict.yaml \
--load-model ./en_de.nmt_models/estimator/target_1/model.torch \
--experiment-name "Single line test" \
--output-dir output \
--gpu-id -1 \
--test-source ./input/test.src \
--test-target ./input/test.trg

cat output/tags

Expected result

OK OK OK OK OK OK OK OK OK OK OK OK OK BAD OK OK OK BAD OK OK OK OK OK OK OK OK OK

Of course, the gold annotation contains the extra gap tags, but despite that most of the sentence is classified as OK, which is contrary to the model output (lots of almost zeroes).

Actual result

0.04104529693722725 0.013736072927713394 0.011828877963125706 0.014644734561443329 0.022598857060074806 0.10979203879833221 0.8875276446342468 0.711827278137207 0.9585599303245544 0.20660772919654846 0.22217749059200287  0.1782749891281128 0.012791415676474571

Environment (please complete the following information):

OS: Fedora 30, Ubuntu 18.04
OpenKiwi version 0.1.2
Python version 3.7.4

captainvera commented 5 years ago

Hey @zouharvi, thanks for your interest in OpenKiwi and the detailed issue!

I believe it is not an error that you're making but a misinterpretation of the results. What we model is the probability of a word being BAD and not the probability of a word being OK. With that in mind, the results you're getting are completely expected 🙂

See below:

Golden Tags (removed gaps)

 OK  OK  OK  OK  OK  OK  BAD  OK  BAD  OK  OK  OK  OK

Your results

 OK  OK  OK  OK  OK  OK BAD BAD BAD  OK  OK  OK  OK

The model is actually only getting one tag wrong. So it's getting around 92% accuracy, not too bad!

zouharvi commented 5 years ago

Thank you for your reply,

What we model is the probability of a word being BAD

This is somewhat counterintuitive to what we expected from QuEst++ and deepQuest, but we're glad it has been resolved. The data now makes sense. :slightly_smiling_face:

We'll rerun our experiments on WMT17 en_de and our custom cs_de data and will let you know of the results (cs_de didn't work out the last time).

captainvera commented 5 years ago

Great, glad I could help you guys.

I'll close the issue for now, but feel free to reopen it if you have any doubts about your results with the cs_de data!

Unbabel / OpenKiwi

Word level QE reproduction #40