aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.14k forks source link

Can't deploy pretrained model even after following the documentation #3640

Open bhattbhuwan13 opened 1 year ago

bhattbhuwan13 commented 1 year ago

Discussed in https://github.com/aws/sagemaker-python-sdk/discussions/3638

Originally posted by **monika-prajapati** February 6, 2023 I have a model that I want to deploy as a sagemaker endpoint. I followed [this documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#bring-your-own-model) and did the following: - Create inference.py script with model_fn, input_fn, predict_fn, and output_fn using [this as reference](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_batch_inference/code/inference.py) - Make file/folder structure according to documentation and make model.tar.gz file . ├── code │ ├── inference.py │ └── requirements.txt └── model.pth I created model.tar.gz with . as root, while in a directory containing `code` folder. My code in the sagemaker notebook looks like this ```python import boto3 import sagemaker from sagemaker.pytorch import PyTorchModel session = boto3.Session() sagemaker_client = session.client('sagemaker') role = sagemaker.get_execution_role() # Define the model data location in S3 model_data = 's3://speech2textmodel/model.tar.gz' # Define the model architecture model1 = PyTorchModel(model_data=model_data, role=role, entry_point='inference.py', framework_version='1.6.0', py_version='py3') predictor = model1.deploy(instance_type='ml.m5.xlarge', initial_instance_count=1) ``` I got error ```bash UnexpectedStatusException: Error hosting endpoint pytorch-inference-2023-02-06-09-28-21-891: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. ``` This is error in cloudwatch ```bash ERROR - /.sagemaker/ts/models/model.mar already exists. ```
KennyTC commented 1 year ago

do you have solution for this? I am facing the same problem

bhattbhuwan13 commented 1 year ago

@KennyTC Nope.

Ruotian-Zhang commented 11 months ago

I am facing the same issue... @KennyTC @bhattbhuwan13 Have you fixed this?

mdmonaco89 commented 11 months ago

I face the same issue with the same error, seems that the error message is not meaningful. In my case the requirement.txt had versions of libraries that weren't compatible with the Python version that I chose for the container image. I realized about that seeing the begin of the CloudWatch log for that particular deploy execution. After I fixed that issue with the requirements, I was able to deploy my PytorchModel and get the endpoint created and running for it.

evankozliner commented 6 months ago

I was able to resolve this by ensuring the Pytorch image version specified matched my custom requirements.txt and python version e.g.

pytorch_model = PyTorchModel(model_data=fname,
                             role=role,
                             entry_point='inference.py',
                             framework_version='2.1.0',
                             py_version='py310')

requirements

boto3==1.33.3
botocore==1.33.3
torch==2.0.0