Closed bricksdont closed 2 years ago
Hi Mathias! The warning during inference shouldn't be an issue, this is expected as we trace a module that returns a list of tensors. We can look into suppressing this warning in the future.
Regarding the error, can you provide the commit/ minor version of Sockeye you are running translation on and what commit/minor version the model was trained with?
Thanks Felix, both training and translation are done with this version:
[INFO:sockeye.utils] Sockeye: 3.1.7, commit b24b2c1352e71659fd61e49f9384f255e4161e5a
[INFO:sockeye.utils] PyTorch: 1.11.0+cu102
Hi @bricksdont,
There was an issue with 3.1.7 where extra copies of the parameters were saved (prefixed with "traced").
If you update to the current main
branch (backward compatible), these extra parameters will be filtered out when models load and newly trained models won't save the "traced" parameters.
Best, Michael
Thanks @mjdenkowski ! Is there a chance you could make a new release 3.1.8 with this fix? If people just install with pip and 3.1.7 is the newest release this could cause a lot of confusion I believe
Of course, 3.1.9 release should be on pypi in a few minutes: https://github.com/awslabs/sockeye/releases/tag/3.1.9
Thanks Felix! Closing this issue, assuming this will solve my problem. Have a nice day everyone
I am not sure the problem is solved already, now I installed 3.1.9 and similar errors occurred during training (when creating or loading a checkpoint) instead of translation.
I'm attaching a full log file.
@mjdenkowski I would be grateful if you could have another look, since I am not familiar enough with this code.
This is a different but related issue. Thanks for your report.
As a temporary workaround, disabling the checkpoint decoder should prevent any traced layers from being created during training: --decode-and-evaluate 0
.
@bricksdont #1042 is merged and v3.1.10 has been released to pypi.
@fhieber @mjdenkowski Thanks for your swift replies & help! I will close this issue once I can confirm that the issue is resolved.
Now finally switched to torch-based Sockeye, I get this warning during training:
and the following error during translation:
All model parameters are prefixed with "traced_" and the inference code thinks those are superfluous keys.
Is this something you immediately recognize and would know how to fix? Of course I can prepare a self-contained test case to reproduce this, but thought perhaps there is a simple explanation. Thanks!