Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch
https://unbabel.github.io/OpenKiwi/
GNU Affero General Public License v3.0
229 stars 48 forks source link

`inf` value for RMSE metric #34

Closed edufierro closed 5 years ago

edufierro commented 5 years ago

Describe the bug

Hi!

I'm training a estimator and I'm getting the following bug:

Traceback (most recent call last):
  File "/path/to/venv/bin/kiwi", line 10, in <module>
    sys.exit(main())
  File /path/to/venv/lib/python3.5/site-packages/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/cli/pipelines/train.py", line 141, in main
    train.train_from_options(options)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/trainers/trainer.py", line 75, in run
    self.train_epoch(train_iterator, valid_iterator)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/trainers/trainer.py", line 97, in train_epoch
    self.stats.log(step=self._step)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/metrics/stats.py", line 167, in log
    stats_summary.log()
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/metrics/stats.py", line 63, in log
    tracking_logger.log_metric(k, v)
  File "/path/to/venv/lib/python3.5/site-packages/kiwi/loggers.py", line 181, in log_metric
    mlflow.log_metric(key, value)
  File "//path/to/venv/lib/python3.5/site-packages/mlflow/tracking/fluent.py", line 199, in log_metric
    MlflowClient().log_metric(run_id, key, value, int(time.time()))
  File "/path/to/venv/lib/python3.5/site-packages/mlflow/tracking/client.py", line 170, in log_metric
    _validate_metric(key, value, timestamp)
  File "/path/to/venv/lib/python3.5/site-packages/mlflow/utils/validation.py", line 67, in _validate_metric
    INVALID_PARAMETER_VALUE)
mlflow.exceptions.MlflowException: Got invalid value inf for metric 'RMSE' (timestamp=1561492769). Please specify value as a valid double (64-bit floating point)

This comes after a warning:

RuntimeWarning: invalid value encountered in subtract
  xm, ym = x - mx, y - my
/path/to/venv/lib/python3.5/site-packages/scipy/stats/stats.py:3036: RuntimeWarning: invalid value encountered in reduce
  r_num = np.add.reduce(xm * ym)

It happened during training, after 16 epochs and after 500 batches of epoch 16.

To Reproduce Both yaml files and data are not public. Let me know and I can share them with you.

Expected behavior Finishes training for the 20 epochs I specified.

Environment (please complete the following information):

captainvera commented 5 years ago

Hello @edufierro,

I have been trying to reproduce your error to no avail. Can you try to reproduce this in your system and let me know If it happens again?

This error is happening in the RMSE metric code:

class RMSEMetric(Metric):
    def __init__(self, **kwargs):
        super().__init__(metric_name='RMSE', **kwargs)

    def update(self, batch, model_out, **kwargs):
        predictions = self.get_predictions_flat(model_out, batch)
        target = self.get_target_flat(batch)
        self.squared_error += ((predictions - target) ** 2).sum().item()
        self.tokens += self.get_tokens(batch)

    def summarize(self):
        rmse = math.sqrt(self.squared_error / self.tokens)
        summary = {self.metric_name: rmse}
        return self._prefix_keys(summary)

For the rmse to be infinite the division self.squared_error / self.tokens needs to be infinite. Since self.tokens = get_tokens can never be a decimal number, this must be caused by some kind of numerical errors when calculating the squared error.

Let me know if you can reproduce this error so we can get to the bottom of it!

edufierro commented 5 years ago

Will do. Thanks @captainvera & team!

captainvera commented 5 years ago

Hey @edufierro, since we're not able to reproduce this issue I'll be closing it for now. If you get this error again or find any way to reproduce it, please feel free to re-open this!