When resuming training via the --continue_path argument, first the following
error is logged, but training continues:
Traceback (most recent call last):
File ".../lib/python3.10/logging/__init__.py", line 1100, in emit
msg = self.format(record)
File ".../lib/python3.10/logging/__init__.py", line 943, in format
return fmt.format(record)
File ".../lib/python3.10/logging/__init__.py", line 678, in format
record.message = record.getMessage()
File ".../lib/python3.10/logging/__init__.py", line 368, in getMessage
msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
File ".../train_vits.py", line 120, in <module>
trainer.fit()
File ".../trainer/trainer.py", line 1826, in fit
self._fit()
File ".../trainer/trainer.py", line 1764, in _fit
self._restore_best_loss()
File ".../trainer/trainer.py", line 1728, in _restore_best_loss
logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 18.22130616851475, 'eval_loss': None}
Then the following error occurs at the end of the epoch and training stops:
File ".../trainer/io.py", line 183, in save_best_model
if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'
TLDR: fixes loading of models via
--continue_path
The issue
When resuming training via the
--continue_path
argument, first the following error is logged, but training continues:Then the following error occurs at the end of the epoch and training stops:
This has been observed multiple times:
There are multiple open PRs to fix some aspects of this issue.
Others have fixed it in their Trainer forks:
The reason
This error occurs because #121 changed https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L1924 to save the
model_loss
as a dict instead of just a float. https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/io.py#L195 still saves a float inmodel_loss
, so loading the best model would still work fine. Loading a model via--restore-path
also works fine because in that case the best loss is reset and not initialised from the saved model.This fix
save_best_model()
to also save a dict with train and eval loss, so that this is consistent everywheremodel_loss
for backwards compatibility