Background

I am trying to do single-model batch transform in SageMaker to get predictions from a pre-trained model (I did not train the model on SageMaker). My end goal is to be able to run just a bit of python code to start a batch transform job and grab the results from S3 when it's done.

import boto3
client = boto3.client("sagemaker")
client.create_transform_job(...)

# occasionally monitor the job
client.describe_transform_job(...)

# fetch results once job is finished
client = boto3.client("s3")
...

I can successfully get the results I need using Transformer.transform() in a SageMaker notebook instance (see the appendix below for code snippets), but in my project I do not want to depend on the SageMaker Python SDK. Instead, I'd rather use boto3 like in the pseudocode above.

The issue

I referenced this example notebook to try and extend a PyTorch inference container (see appendix below for the dockerfile I am using), but I can't get the same results that I can when I use the SageMaker Python SDK in a notebook instance. Instead I get this error:

Backend worker process died.
Traceback (most recent call last):
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 182, in <module>
        worker.run_server()
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 154, in run_server
        self.handle_connection(cl_socket)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 116, in handle_connection
        service, result, code = self.load_model(msg)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_service_worker.py", line 89, in load_model
        service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
    File "/opt/conda/lib/python3.6/site-packages/ts/model_loader.py", line 110, in load
        initialize_fn(service.context)
    File "/home/model-server/tmp/models/d00cc5c716dc4e4582250bd89915b99b/handler_service.py", line 51, in initialize
        super().initialize(context)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
        self._service.validate_and_initialize(model_dir=model_dir)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 158, in validate_and_initialize
        self._model = self._model_fn(model_dir)
    File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 55, in default_model_fn
        NotImplementedError:
            Please provide a model_fn implementation.
            See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk

The problem seems to be that when the inference toolkit tries to import a customized inference.py script, it can't find it, presumably because /opt/ml/model/code is not found in sys.path. https://github.com/aws/sagemaker-inference-toolkit/blob/cb9e793a79ef4dbc165b0cc48d3c6202916cea33/src/sagemaker_inference/transformer.py#L169

If I understand the code correctly, then in this snippet below (which runs before the snippet above), we are attempting to add the code_dir to sys.path, but this won't affect the current runtime. https://github.com/aws/sagemaker-inference-toolkit/blob/cb9e793a79ef4dbc165b0cc48d3c6202916cea33/src/sagemaker_inference/default_handler_service.py#L59-L64 I wonder if it should be like this instead:

import sys
from sagemaker_inference.environment import code_dir
...
# add model_dir/code to python path 
if code_dir not in sys.path:
    sys.path.append(code_dir)

Appendix

Notebook cells containing code I was able to run successfully

Here's what I can get running in a SageMaker notebook instance (ml.p2.xlarge). The last cell takes about 5 minutes to run.

from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel

# fill out proper values here
path_to_model = "s3://bucket/path/to/model/model.tar.gz"

repo = "GITHUB_REPO_URL_HERE"
branch = "BRANCH_NAME_HERE"
token = "GITHUB_PAT_HERE"

path_to_code_location = "s3://bucket/path/to/code/location"
github_repo_source_dir = "relative/path/to/entry/point"

path_to_output = "s3://bucket/path/to/output"
path_to_input = "s3://bucket/path/to/input"

pytorch_model = PyTorchModel(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.4-gpu-py36",  # the latest supported version I could get working
    model_data=path_to_model,
    git_config={
        "repo": repo,
        "branch": branch,
        "token": token,
    },
    code_location=path_to_code_location,  # must provide this so that a default bucket isn't created
    source_dir=github_repo_source_dir,
    entry_point="inference.py",
    role=get_execution_role(),
    py_version="py3",
    framework_version="1.4",  # must provide this even though we are supplying `image_uri`
)

transformer = pytorch_model.transformer(
    instance_count=1,
    instance_type="local_gpu",
    strategy="SingleRecord",
    output_path=path_to_output,
    accept="image/png",
)

transformer.transform(
    data=path_to_input,
    data_type="S3Prefix",
    content_type="image/png",
    compression_type=None,
    wait=True,
    logs=True,
)

Dockerfile for extended container

# Tutorial for extending AWS SageMaker PyTorch containers:
# https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb
ARG REGION=us-west-2

# SageMaker PyTorch Image
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/pytorch-inference:1.8.1-gpu-py36-cu111-ubuntu18.04

ARG CODE_DIR=/opt/ml/model/code
ENV PATH="${CODE_DIR}:${PATH}"

# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /inference ${CODE_DIR}

# Used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY ${CODE_DIR}

# Used by the SageMaker PyTorch container to determine our program entry point.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM inference.py

aws / sagemaker-inference-toolkit

Custom `model_fn` function not found when extending the PyTorch inference container #86

Background

The issue

Appendix

Notebook cells containing code I was able to run successfully

Dockerfile for extended container