Closed zouharvi closed 5 years ago
Hey @zouharvi, thanks for your interest in OpenKiwi and the detailed issue!
I believe it is not an error that you're making but a misinterpretation of the results.
What we model is the probability of a word being BAD
and not the probability of a word being OK
. With that in mind, the results you're getting are completely expected 🙂
See below:
Golden Tags (removed gaps)
OK OK OK OK OK OK BAD OK BAD OK OK OK OK
Your results
OK OK OK OK OK OK BAD BAD BAD OK OK OK OK
The model is actually only getting one tag wrong. So it's getting around 92% accuracy, not too bad!
Thank you for your reply,
What we model is the probability of a word being BAD
This is somewhat counterintuitive to what we expected from QuEst++ and deepQuest, but we're glad it has been resolved. The data now makes sense. :slightly_smiling_face:
We'll rerun our experiments on WMT17 en_de and our custom cs_de data and will let you know of the results (cs_de didn't work out the last time).
Great, glad I could help you guys.
I'll close the issue for now, but feel free to reopen it if you have any doubts about your results with the cs_de data!
As per the email we sent all of the paper authors we (@zouharvi, @obo) trained the predictor and then the estimator on our custom data, but the results were almost random.
Describe the bug While trying to find the problem we tried to reproduce the WMT result based on your pre-trained models, as mentioned in the documentation. There must be some systematic mistake we're making because the pre-trained estimator produces almost random results.
To Reproduce Run in an empty directory. The script downloads the model and then tries to estimate the quality of the first sentence from the training dataset in WMT18.
Expected result
Of course, the gold annotation contains the extra gap tags, but despite that most of the sentence is classified as OK, which is contrary to the model output (lots of almost zeroes).
Actual result
Environment (please complete the following information):