awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

Model parameter keys are prefixed with "traced" #1040

Closed bricksdont closed 2 years ago

bricksdont commented 2 years ago

Now finally switched to torch-based Sockeye, I get this warning during training:

/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/torch/jit/_trace.py:965: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for list, use a tuple instead. for dict, use a NamedTuple instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.

and the following error during translation:

ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/translate.py", line 264, in <module>
    main()
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/translate.py", line 43, in main
    run_translate(args)
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/translate.py", line 79, in run_translate
    inference_only=True)
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/model.py", line 685, in load_models
    forward_pass_cache_size=forward_pass_cache_size)
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/model.py", line 616, in load_model
    ignore_extra=False)
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/model.py", line 357, in load_parameters
    utils.check_condition(not unexpected, f"extra keys: {unexpected}")
  File "/net/cephfs/shares/volk.cl.uzh/mathmu/easier-gloss-translation/venvs/sockeye3/lib/python3.7/site-packages/sockeye/utils.py", line 129, in check_condition
    raise SockeyeError(error_message)
sockeye.utils.SockeyeError: extra keys: ['traced_embedding_source.embedding.weight', 'traced_encoder.pos_embedding.weight', 'traced_encoder.layers.0.pre_self_attention.layer_norm.weight', 'traced_encoder.layers.0.pre_self_attention.layer_norm.bias', 'traced_encoder.layers.0.self_attention.ff_out.weight'
[... all model parameters I think]

All model parameters are prefixed with "traced_" and the inference code thinks those are superfluous keys.

Is this something you immediately recognize and would know how to fix? Of course I can prepare a self-contained test case to reproduce this, but thought perhaps there is a simple explanation. Thanks!

fhieber commented 2 years ago

Hi Mathias! The warning during inference shouldn't be an issue, this is expected as we trace a module that returns a list of tensors. We can look into suppressing this warning in the future.

Regarding the error, can you provide the commit/ minor version of Sockeye you are running translation on and what commit/minor version the model was trained with?

bricksdont commented 2 years ago

Thanks Felix, both training and translation are done with this version:

[INFO:sockeye.utils] Sockeye: 3.1.7, commit b24b2c1352e71659fd61e49f9384f255e4161e5a
[INFO:sockeye.utils] PyTorch: 1.11.0+cu102
mjdenkowski commented 2 years ago

Hi @bricksdont,

There was an issue with 3.1.7 where extra copies of the parameters were saved (prefixed with "traced").

If you update to the current main branch (backward compatible), these extra parameters will be filtered out when models load and newly trained models won't save the "traced" parameters.

Best, Michael

bricksdont commented 2 years ago

Thanks @mjdenkowski ! Is there a chance you could make a new release 3.1.8 with this fix? If people just install with pip and 3.1.7 is the newest release this could cause a lot of confusion I believe

fhieber commented 2 years ago

Of course, 3.1.9 release should be on pypi in a few minutes: https://github.com/awslabs/sockeye/releases/tag/3.1.9

bricksdont commented 2 years ago

Thanks Felix! Closing this issue, assuming this will solve my problem. Have a nice day everyone

bricksdont commented 2 years ago

I am not sure the problem is solved already, now I installed 3.1.9 and similar errors occurred during training (when creating or loading a checkpoint) instead of translation.

I'm attaching a full log file.

@mjdenkowski I would be grateful if you could have another look, since I am not familiar enough with this code.

mjdenkowski commented 2 years ago

This is a different but related issue. Thanks for your report.

As a temporary workaround, disabling the checkpoint decoder should prevent any traced layers from being created during training: --decode-and-evaluate 0.

fhieber commented 2 years ago

@bricksdont #1042 is merged and v3.1.10 has been released to pypi.

bricksdont commented 2 years ago

@fhieber @mjdenkowski Thanks for your swift replies & help! I will close this issue once I can confirm that the issue is resolved.