I am trying to do single-model batch transform in SageMaker to get predictions from a pre-trained model (I did not train the model on SageMaker). My end goal is to be able to run just a bit of python code to start a batch transform job and grab the results from S3 when it's done.
import boto3
client = boto3.client("sagemaker")
client.create_transform_job(...)
# occasionally monitor the job
client.describe_transform_job(...)
# fetch results once job is finished
client = boto3.client("s3")
...
I can successfully get the results I need using Transformer.transform() in a SageMaker notebook instance (see the appendix below for code snippets), but in my project I do not want to depend on the SageMaker Python SDK. Instead, I'd rather use boto3 like in the pseudocode above.
The issue
I referenced this example notebook to try and extend a PyTorch inference container (see appendix below for the dockerfile I am using), but I can't get the same results that I can when I use the SageMaker Python SDK in a notebook instance. Instead I get this error:
Backend worker process died.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 182, in <module>
worker.run_server()
File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 154, in run_server
self.handle_connection(cl_socket)
File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 116, in handle_connection
service, result, code = self.load_model(msg)
File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 89, in load_model
service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
File "/opt/conda/lib/python3.6/site-packages/ts/model_loader.py", line 110, in load
initialize_fn(service.context)
File "/home/model-server/tmp/models/d00cc5c716dc4e4582250bd89915b99b/handler_service.py", line 51, in initialize
super().initialize(context)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
self._service.validate_and_initialize(model_dir=model_dir)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 158, in validate_and_initialize
self._model = self._model_fn(model_dir)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 55, in default_model_fn
NotImplementedError:
Please provide a model_fn implementation.
See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk
import sys
from sagemaker_inference.environment import code_dir
...
# add model_dir/code to python path
if code_dir not in sys.path:
sys.path.append(code_dir)
Appendix
Notebook cells containing code I was able to run successfully
Here's what I can get running in a SageMaker notebook instance (ml.p2.xlarge). The last cell takes about 5 minutes to run.
from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel
# fill out proper values here
path_to_model = "s3://bucket/path/to/model/model.tar.gz"
repo = "GITHUB_REPO_URL_HERE"
branch = "BRANCH_NAME_HERE"
token = "GITHUB_PAT_HERE"
path_to_code_location = "s3://bucket/path/to/code/location"
github_repo_source_dir = "relative/path/to/entry/point"
path_to_output = "s3://bucket/path/to/output"
path_to_input = "s3://bucket/path/to/input"
pytorch_model = PyTorchModel(
image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.4-gpu-py36", # the latest supported version I could get working
model_data=path_to_model,
git_config={
"repo": repo,
"branch": branch,
"token": token,
},
code_location=path_to_code_location, # must provide this so that a default bucket isn't created
source_dir=github_repo_source_dir,
entry_point="inference.py",
role=get_execution_role(),
py_version="py3",
framework_version="1.4", # must provide this even though we are supplying `image_uri`
)
# Tutorial for extending AWS SageMaker PyTorch containers:
# https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb
ARG REGION=us-west-2
# SageMaker PyTorch Image
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.8.1-gpu-py36-cu111-ubuntu18.04
ARG CODE_DIR=/opt/ml/model/code
ENV PATH="${CODE_DIR}:${PATH}"
# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /inference ${CODE_DIR}
# Used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY ${CODE_DIR}
# Used by the SageMaker PyTorch container to determine our program entry point.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM inference.py
Background
I am trying to do single-model batch transform in SageMaker to get predictions from a pre-trained model (I did not train the model on SageMaker). My end goal is to be able to run just a bit of python code to start a batch transform job and grab the results from S3 when it's done.
I can successfully get the results I need using
Transformer.transform()
in a SageMaker notebook instance (see the appendix below for code snippets), but in my project I do not want to depend on the SageMaker Python SDK. Instead, I'd rather useboto3
like in the pseudocode above.The issue
I referenced this example notebook to try and extend a PyTorch inference container (see appendix below for the dockerfile I am using), but I can't get the same results that I can when I use the SageMaker Python SDK in a notebook instance. Instead I get this error:
The problem seems to be that when the inference toolkit tries to import a customized
inference.py
script, it can't find it, presumably because/opt/ml/model/code
is not found insys.path
. https://github.com/aws/sagemaker-inference-toolkit/blob/cb9e793a79ef4dbc165b0cc48d3c6202916cea33/src/sagemaker_inference/transformer.py#L169If I understand the code correctly, then in this snippet below (which runs before the snippet above), we are attempting to add the
code_dir
tosys.path
, but this won't affect the current runtime. https://github.com/aws/sagemaker-inference-toolkit/blob/cb9e793a79ef4dbc165b0cc48d3c6202916cea33/src/sagemaker_inference/default_handler_service.py#L59-L64 I wonder if it should be like this instead:Appendix
Notebook cells containing code I was able to run successfully
Here's what I can get running in a SageMaker notebook instance (
ml.p2.xlarge
). The last cell takes about 5 minutes to run.Dockerfile for extended container