Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch
https://unbabel.github.io/OpenKiwi/
GNU Affero General Public License v3.0
229 stars 48 forks source link

array must not contain infs or NaNs #47

Closed BigBorg closed 4 years ago

BigBorg commented 4 years ago

I am training estimator on an en-zh dataset. At first everything runs well. But after epoch 8, it says "array must not contain infs or NaNs" and exists. I don't know why this happen.

captainvera commented 4 years ago

Hello @BigBorg!

Can you provide a reproducible example using public data? Ideally, a config and a small dataset that encountered this error would be amazing.

I would love to help, but it's hard to diagnose if I can't reproduce the issue :)

BigBorg commented 4 years ago

Sorry the dataset is private. I might try some public dataset to see if this happens again.

trenous commented 4 years ago

If you share the full stack trace and the config file you used, we might also be able to help.

BigBorg commented 4 years ago

Sending codes out from the company i work for is restricted. Turning off sentence-ll solves the problem. Is it possible that the error is 0 then becomes inf after log? Besides, sentence scores produced by the model might be larger than 1, how do I interpret score?

captainvera commented 4 years ago

Understandable. It could indeed be the case, but it seems weird that we never encountered this error while training with our own data or with publicly available datasets... If the reason becomes clearer, please let us know.

On the second question, sentence scores are an attempt to predict TER (Translation Error Rate), or the distance that separates the current translation from a "perfect" translation. With 1 meaning the whole sentence needs to be changed and 0 meaning the sentence is correct.

The model shouldn't produce scores above 1, what kind of scores are you seeing? Are you sure your training data contains all TER values in the range [0-1]?

BigBorg commented 4 years ago

Thanks for reminding me to inspect training data. It does contain hter larger than 1. I don't know why tercom is producing such result. I might try python package pyter to generate hter.

trenous commented 4 years ago

Tercom computes hter as (Edit Distance mt - pe ) / (len(pe))

Thus if the MT is longer than the postedition, you can have an hter longer than 1 (this will typically be a case of MT repetitions / hallucinations). In the QE shared task, the scores are truncated to be at most 1.

The sentence-scores output by the model can be greater than 1 if you turn off sentence-ll. As you can see in the code, the sentence score prediction module does not have a squashing function in the last layer. If you enable sentence-ll, the model outputs a gaussian distribution that is truncated over the interval [0, 1]. In that case, model scores are the mean of that distribution, which will always lie within the interval itself.