aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

pytorch container version 1.5.1 deployment with Elastic inferance error leads to torchscript model trace being unable to be loaded #2911

Closed cm2435 closed 2 years ago

cm2435 commented 2 years ago

Describe the bug Hello All! First time raising an issue with Sagemaker so forgive me if this is wrong place or format. Trying to deploy Bert model to return sentence embeddings in pytorch container with the model served as torchscript trace .pt file. Deployment in framework version 1.5 works as intended, until an elastic accelerator is attached to the estimator deployed, at which point the torchserve logs show that the model cannot be loaded. Using framework 1.5.1 further yields an import error of torch EIA.

To reproduce Following the documentation on how to load a torchscript model in model_fn at the link below. https://docs.aws.amazon.com/elastic-inference/latest/developerguide/ei-pytorch-using.html

I load my model using the following

def model_fn(model_dir):
    logger.info('model_fn')
    torch._C._jit_set_profiling_executor(False)

    jit_path = os.path.join(model_dir, "traced_bert.pt")
    traced_model = torch.jit.load(jit_path, map_location=torch.device('cpu'))
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

    if torch.__version__ == '1.5.1':
        import torcheia
        traced_model = traced_model.eval()
        # attach_eia() is introduced in PyTorch Elastic Inference 1.5.1,
        traced_model = torcheia.jit.attach_eia(traced_model, 0)

        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        logger.info(model_dir)
        traced_model.to(device)

    model_dict = {'model':traced_model, 'tokenizer':tokenizer}
    return model_dict

Inside of a inference.py file in a .code subdirectory, as specified by the documentation.The Pytorch model class is intialized as follows:

from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import Predictor
from sagemaker import get_execution_role

class StringPredictor(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session,)# content_type='text/plain')

pytorch_model = PyTorchModel(model_data = pt_model_data, 
                             role=role, 
                             entry_point ='inference.py',
                             source_dir = './code',
                             py_version = 'py3', 
                             framework_version = '1.5.1',
                             predictor_cls=StringPredictor)

Expected behavior This as, without elastic inferance, load a bert model trace from the traced_bert.pt file, and use it to return embeddings of an input string or list of strings. However, it instead returns the following cloudwatch logs below:

Screenshots or logsy


2022-02-08 16:01:24,875 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died. | 2022-02-08  16:01:24,875 [WARN ] W-9000-model  com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model,  error: Worker died. | AllTraffic/i-018a65bd4adef637d
-- | -- | --
 

<span><span class=""><div class="logs__log-events-table__formatted-message" data-testid="logs__log-events-table__formatted-message"><div class="logs__log-events-table__content">2022-02-08 16:01:24,877 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ImportError: /opt/conda/lib/python3.6/site-packages/_torch_eia.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZTVN5torch8autograd8profiler14RecordFunctionE</div></div></span></span>2022-02-08 16:01:24,875 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
    2022-02-08 16:01:24,875 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.   [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216077)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,875 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-66ac40cc-6e0e4860-8bac23a9-f092939f-4a6b7bf9-d703e7b2-d14f0c4a-a02d37b7-2786bf9f in 34 seconds.
    2022-02-08 16:01:24,875 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9000-66ac40cc-6e0e4860-8bac23a9-f092939f-4a6b7bf9-d703e7b2-d14f0c4a-a02d37b7-2786bf9f in 34 seconds. [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216078)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,875 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.module = importlib.import_module(module_name)
    2022-02-08 16:01:24,875 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.module = importlib.import_module(module_name)   [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216079)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,875 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
    2022-02-08 16:01:24,875 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module    [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216080)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level)  [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216081)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 994, in _gcd_import   [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216082)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 971, in _find_and_load    [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216083)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked   [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216084)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 665, in _load_unlocked    [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216085)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap_external>", line 678, in exec_module
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap_external>", line 678, in exec_module  [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216086)
    2022-02-08T16:01:25.872+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085772$26refEventId$3D36669920071780642424269827607790865012690228969552216087)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 17, in <module>
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 17, in <module>   [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085773$26refEventId$3D36669920071802943169468358230932400730962877331058196504)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     from sagemaker_pytorch_serving_container.default_inference_handler import \
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - from sagemaker_pytorch_serving_container.default_inference_handler import \  [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085773$26refEventId$3D36669920071802943169468358230932400730962877331058196505)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/default_inference_handler.py", line 18, in <module>
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/default_inference_handler.py", line 18, in <module> [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085773$26refEventId$3D36669920071802943169468358230932400730962877331058196506)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     import torch, torcheia
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - import torch, torcheia   [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085773$26refEventId$3D36669920071802943169468358230932400730962877331058196507)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/torcheia/__init__.py", line 1, in <module>
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/torcheia/__init__.py", line 1, in <module>  [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085773$26refEventId$3D36669920071802943169468358230932400730962877331058196508)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     from _torch_eia import *
    2022-02-08 16:01:24,876 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - from _torch_eia import * [AllTraffic/i-018a65bd4adef637d](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fsimilar-language-search-1644335320/log-events/AllTraffic$252Fi-018a65bd4adef637d$3Fstart$3D1644336085773$26refEventId$3D36669920071802943169468358230932400730962877331058196509)
    2022-02-08T16:01:25.873+00:00
2022-02-08 16:01:24,877 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ImportError: /opt/conda/lib/python3.6/site-packages/_torch_eia.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZTVN5torch8autograd8profiler14RecordFunctionE

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

Deployment Specs

local_mode = False

if local_mode:
    instance_type = "local"
else:
    instance_type = "ml.c4.xlarge"
   # instance_type = "ml.c4.xlarge"

predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name = f'similar-language-search-{int(time.time())}',
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
    accelerator_type='ml.eia2.medium'
)
solidmetanoia commented 2 years ago

@cm2435 Sorry for necroing the issue, but have you found out what the problem was? I've got the exact same problem, torcheia import failure with the same symbol code, though on python3.7.

Edit: trace

Traceback (most recent call last):
  File "stub.py", line 13, in <module>
    import global_vars
  File "/home/ec2-user/global_vars.py", line 3, in <module>
    from detectors.yoloxtorch import YoloxTorchDetector
  File "/home/ec2-user/detectors/yoloxtorch.py", line 25, in <module>
    import torcheia
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torcheia/__init__.py", line 1, in <module>
    from _torch_eia import * 
ImportError: /home/ec2-user/.local/lib/python3.7/site-packages/_torch_eia.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZTVN5torch8autograd8profiler14RecordFunctionE
cm2435 commented 2 years ago

@solidmetanoia No worrries :)

Honestly, I ended up not being able to fix this. My guess is that it's a package mismatch where the official pytorch Deep Learning Container by the Sagemaker team has a different version of cudnn or cuda to the EIA package version requirements.

In the end I just ended rolling my own serving/training container just using a fastapi/nginx stack with guicorn webserver for concurrency. Let me know if you want the boiler plate for it and I will share.

solidmetanoia commented 2 years ago

Yeah, nah, thank you. I'll keep trying to somehow connect things. :+1: