aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.13k forks source link

Model.compile ignores framework_version when compiling for ml_inf1 #3209

Open vprecup opened 2 years ago

vprecup commented 2 years ago

Describe the bug When attempting to compile for ml_inf1 via the SDK a model which was trained / fine-tuned with PyTorch 1.9.1, the framework_version argument is ignored, resulting in a version mismatch between the one at training and the one chosen automatically by SageMaker.

To reproduce

import json
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor

sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.9",
    role="<role_arn>",
    entry_point="inference.py",
    source_dir="code",
    py_version="py3",
    name="<name>"
) 

compiled_inf_model = sm_model.compile(
    target_instance_family="ml_inf1",
    input_shape=<input_shape>,
    job_name="<job_name>",
    role="<role_arn>",
    framework="pytorch",
    framework_version="1.9",
    output_path="<output_path>"
    compiler_options=json.dumps("--dtype int64"),
    compile_max_run=1000, 
)

Expected behavior The Compilation Job should also contain the "Framework version" when opening it in the AWS Console. However, only the PYTORCH Framework value is present, and the compilation fails after 5 minutes with the error message ClientError: InputConfiguration: Unable to load PyTorch model:', '\nUnknown type name \'NoneType\':\nSerialized File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7\n _is_full_backward_hook : Optional[bool]\n def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,\n argument_1: Tensor) -> NoneType:\n ~~~~~~~~ <--- HERE\n return None\n') For further troubleshooting common failures please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-troubleshooting-compilation.html

image

If, however, I clone the failed job in the AWS Console and just add the 1.9 "Framework version" manually, the job runs to completion.

Screenshots or logs

localhost compiler-container-Primary[5078]: Traceback (most recent call last):
--
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5078]:     model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5078]:     script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5078]:     cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5078]: RuntimeError:
localhost compiler-container-Primary[5078]: Unknown type name 'NoneType':
localhost compiler-container-Primary[5078]: Serialized   File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7
localhost compiler-container-Primary[5078]:   _is_full_backward_hook : Optional[bool]
localhost compiler-container-Primary[5078]:   def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,
localhost compiler-container-Primary[5078]:     argument_1: Tensor) -> NoneType:
localhost compiler-container-Primary[5078]:                            ~~~~~~~~ <--- HERE
localhost compiler-container-Primary[5078]:     return None
localhost compiler-container-Primary[5078]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5078]: Traceback (most recent call last):
localhost compiler-container-Primary[5078]:   File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5078]:     compile()
localhost compiler-container-Primary[5078]:   File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5078]:     compiler_options
localhost compiler-container-Primary[5078]:   File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5078]:     return framework_instance.compile_model()
localhost compiler-container-Primary[5078]:   File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5078]:     raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5078]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', ' Unknown type name \'NoneType\': Serialized   File "code/__torch__/torch/nn/modules/activation/___torch_mangle_8258.py", line 7   _is_full_backward_hook : Optional[bool]   def forward(self: __torch__.torch.nn.modules.activation.___torch_mangle_8258.Tanh,     argument_1: Tensor) -> NoneType:                            ~~~~~~~~ <--- HERE     return None ')

System information A description of your system. Please provide:

Additional context The problem may lie in the negative lookahead regex group (?!ml_inf) at this line: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L735. Is this condition still applicable?

kosmobiker commented 1 year ago

I have got something similar My code

compiled_sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.5",
    role=role,
    sagemaker_session=sess,
    entry_point="inference_inf1.py",
    source_dir="aux",
    py_version="py3",
    env={"MMS_DEFAULT_RESPONSE_TIMEOUT": "500"},
)

hardware = "inf1"
compilation_job_name = name_from_base("godel")

compiled_inf1_model = compiled_sm_model.compile(
    target_instance_family=f"ml_{hardware}",
    input_shape={"input_ids": [1, 512], "attention_mask": [1, 512]},
    job_name=compilation_job_name,
    role=role,
    framework="pytorch",
    framework_version="1.5",
    output_path=f"s3://{sagemaker_session_bucket}/{prefix}/compiled_model",
    compiler_options=json.dumps("--dtype int64"),
    compile_max_run=900,
)

Constant error is


localhost compiler-container-Primary[5145]: 2022-09-27 15:00:27,229 INFO root Successfully download and extract model into /tmp/models
--
localhost compiler-container-Primary[5145]: 2022-09-27 15:00:28,305 INFO neo_inferentia_compiler.pytorch_framework Neuron Compilation Inputs: model_file /tmp/models/./model.pth output_file /opt/ml/model/compiled/model.pth
localhost compiler-container-Primary[5145]: Traceback (most recent call last):
localhost compiler-container-Primary[5145]:   File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 107, in compile_model
localhost compiler-container-Primary[5145]:     model = torch.jit.load(self.model_file)
localhost compiler-container-Primary[5145]:   File "/opt/amazon/lib/python3.6/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
localhost compiler-container-Primary[5145]:     script_module = jit_load(*args, **kwargs)
localhost compiler-container-Primary[5145]:   File "/opt/amazon/lib/python3.6/site-packages/torch/jit/_serialization.py", line 161, in load
localhost compiler-container-Primary[5145]:     cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
localhost compiler-container-Primary[5145]: RuntimeError: [enforce fail at inline_container.cc:222] . file not found: model/version
localhost compiler-container-Primary[5145]: During handling of the above exception, another exception occurred:
localhost compiler-container-Primary[5145]: Traceback (most recent call last):
localhost compiler-container-Primary[5145]:   File "/opt/amazon/bin/neo_main.py", line 101, in <module>
localhost compiler-container-Primary[5145]:     compile()
localhost compiler-container-Primary[5145]:   File "/opt/amazon/bin/neo_main.py", line 74, in compile
localhost compiler-container-Primary[5145]:     compiler_options
localhost compiler-container-Primary[5145]:   File "/opt/amazon/bin/neo_main.py", line 32, in compile_model
localhost compiler-container-Primary[5145]:     return framework_instance.compile_model()
localhost compiler-container-Primary[5145]:   File "/opt/amazon/lib/python3.6/site-packages/neo_inferentia_compiler/pytorch_framework.py", line 109, in compile_model
localhost compiler-container-Primary[5145]:     raise RuntimeError("InputConfiguration: Unable to load PyTorch model:", str(e))
localhost compiler-container-Primary[5145]: RuntimeError: ('InputConfiguration: Unable to load PyTorch model:', '[enforce fail at inline_container.cc:222] . file not found: model/version')

I just try to figure out is it my bad or something is not ok with SM. Thanks

PS torch version is 1.12.1 sagemaker 2.107.0