The issue

When resuming training via the --continue_path argument, first the following error is logged, but training continues:

Traceback (most recent call last):
  File ".../lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File ".../lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File ".../lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File ".../lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File ".../train_vits.py", line 120, in <module>
    trainer.fit()
  File ".../trainer/trainer.py", line 1826, in fit
    self._fit()
  File ".../trainer/trainer.py", line 1764, in _fit
    self._restore_best_loss()
  File ".../trainer/trainer.py", line 1728, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 18.22130616851475, 'eval_loss': None}

Then the following error occurs at the end of the epoch and training stops:

  File ".../trainer/io.py", line 183, in save_best_model
    if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'

This has been observed multiple times:

There are multiple open PRs to fix some aspects of this issue.

Others have fixed it in their Trainer forks:

The reason

This error occurs because #121 changed https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L1924 to save the model_loss as a dict instead of just a float. https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/io.py#L195 still saves a float in model_loss, so loading the best model would still work fine. Loading a model via --restore-path also works fine because in that case the best loss is reset and not initialised from the saved model.

This fix

changes save_best_model() to also save a dict with train and eval loss, so that this is consistent everywhere
ensures that the model loader can handle both float and dict model_loss for backwards compatibility
adds relevant test cases that would have previously failed

erogol commented 9 months ago

@eginhard thanks for the PR. There is a small fixup needed for a style issue. Else good to go.

eginhard commented 9 months ago

@erogol I simplified that if statement, should be good now.

coqui-ai / Trainer

fix: make --continue_path work again #131

The issue

The reason

This fix