aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

Cannot Deploy Endpoint #1835

Closed goldmermaid closed 3 years ago

goldmermaid commented 4 years ago

Describe the bug

Error hosting endpoint xxxxxx: Failed. Reason:  Please make sure all images included in the model for the production variant AllTraffic exist, and that the execution role used to create the model has permissions to access them..

To reproduce

sagemaker_model = MXNetModel(model_data=trained_model_upload, 
                             image='emnlp:opt', # docker images
                             role=sagemaker.get_execution_role(), 
                             py_version='py3',            # python version
                             entry_point='serve.py',
#                              source_dir='.'
                            )

Please see more details here: https://github.com/goldmermaid/KDD2020/blob/master/EMNLP/train_deploy_bert.ipynb.

Expected behavior Deployed to endpoint.

System information Sagemaker 2.3.0

Additional context Thanks for the help!

metrizable commented 4 years ago

Hello @goldmermaid

Thank you for using Amazon SageMaker. I see that you have specified image emnlp:opt. The error message indicates there may be an issue with it:

Please make sure all images included in the model for the production variant AllTraffic exist, and that the execution role used to create the model has permissions to access them.

Can you confirm 1) the existence of the image uri you have specified, and 2) that the execution role you are using has access to it?

Best regards

goldmermaid commented 4 years ago

Hi @metrizable , thank you for the quick reply. I confirm the image uri exists. (I checked by calling docker image ls, and this image is there.) For your second question, I am not exactly sure whether I set access correctly. Could you give me some hints on how to grant access? Thanks!

metrizable commented 4 years ago

@goldmermaid

Just to confirm, are you specifying the full URI (unique resource identifier) of the image (should be something like <aws_account_id>.dkr.ecr.<your-region>.amazonaws.com/<image-name>:<tag>)? From here, it doesn't look like it. This example notebook uses a custom image specifying the full image URI.

lu-liu-rft commented 3 years ago

Hi I got the same error when using

from sagemaker.tensorflow.serving import Model

model = Model(model_data=model_data,
              role=role,
              framework_version='1.15.2',
              sagemaker_session=sagemaker_session,
              name=name)

predictor = model.deploy(initial_instance_count=1,
    instance_type='xxx',
    endpoint_name=name,
    update_endpoint=False)

I did add the following policies to my SageMaker execution role:

{
            "Effect": "Allow",
            "Action": [
                "ecr:SetRepositoryPolicy",
                "ecr:CompleteLayerUpload",
                "ecr:BatchGetImage",                
                "ecr:BatchDeleteImage",
                "ecr:UploadLayerPart",
                "ecr:DeleteRepositoryPolicy",
                "ecr:InitiateLayerUpload",
                "ecr:DeleteRepository",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/*sagemaker*"
        },

When I still have this error and couldn't deploy the model?