Open vprecup opened 2 years ago
I have got something similar My code
compiled_sm_model = PyTorchModel(
model_data=traced_model_url,
predictor_cls=Predictor,
framework_version="1.5",
role=role,
sagemaker_session=sess,
entry_point="inference_inf1.py",
source_dir="aux",
py_version="py3",
env={"MMS_DEFAULT_RESPONSE_TIMEOUT": "500"},
)
hardware = "inf1"
compilation_job_name = name_from_base("godel")
compiled_inf1_model = compiled_sm_model.compile(
target_instance_family=f"ml_{hardware}",
input_shape={"input_ids": [1, 512], "attention_mask": [1, 512]},
job_name=compilation_job_name,
role=role,
framework="pytorch",
framework_version="1.5",
output_path=f"s3://{sagemaker_session_bucket}/{prefix}/compiled_model",
compiler_options=json.dumps("--dtype int64"),
compile_max_run=900,
)
Constant error is
localhost compiler-container-Primary[5145]: 2022-09-27 15:00:27,229 INFO root Successfully download and extract model into /tmp/models
--
localhost compiler-container-Primary[5145]: 2022-09-27 15:00:28,305 INFO neo_inferentia_compiler.pytorch_framework Neuron Compilation Inputs: model_file /tmp/models/./model.pth output_file /opt/ml/model/compiled/model.pth
localhost compiler-container-Primary[5145]: Traceback (most recent call last):
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5145]: model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5145]: script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5145]: cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5145]: RuntimeError: [enforce fail at inline_container.cc:222] . file not found: model/version
localhost compiler-container-Primary[5145]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5145]: Traceback (most recent call last):
localhost compiler-container-Primary[5145]: File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5145]: compile()
localhost compiler-container-Primary[5145]: File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5145]: compiler_options
localhost compiler-container-Primary[5145]: File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5145]: return framework_instance.compile_model()
localhost compiler-container-Primary[5145]: File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5145]: raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5145]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', '[enforce fail at inline_container.cc:222] . file not found: model/version')
I just try to figure out is it my bad or something is not ok with SM. Thanks
PS torch version is 1.12.1 sagemaker 2.107.0
Describe the bug When attempting to compile for ml_inf1 via the SDK a model which was trained / fine-tuned with PyTorch 1.9.1, the
framework_version
argument is ignored, resulting in a version mismatch between the one at training and the one chosen automatically by SageMaker.To reproduce
Expected behavior The Compilation Job should also contain the "Framework version" when opening it in the AWS Console. However, only the
PYTORCH
Framework value is present, and the compilation fails after 5 minutes with the error messageClientError: InputConfiguration: Unable to load PyTorch model:', '\nUnknown type name \'NoneType\':\nSerialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7\n _is_full_backward_hook : Optional[bool]\n def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,\n argument_1: Tensor) -> NoneType:\n ~~~~~~~~ <--- HERE\n return None\n') For further troubleshooting common failures please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html
If, however, I clone the failed job in the AWS Console and just add the 1.9 "Framework version" manually, the job runs to completion.
Screenshots or logs
System information A description of your system. Please provide:
Additional context The problem may lie in the negative lookahead regex group
(?!ml_inf)
at this line: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L735. Is this condition still applicable?