aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
131 stars 70 forks source link

Improve error logging when invoking custom handler methods #164

Closed namannandan closed 5 months ago

namannandan commented 5 months ago

Issue #163

Description of changes: Improve debuggability during model load and inference failures caused by custom handler method implementation. This is done by logging the exception traceback in addition to sending the traceback in the response back to client. Although this trackback is sent back to the client in the response body, the client may sometimes fail to load entire response body, for ex: botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary and could not load the entire response body. See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-pytorch-serving-**********-**** in account ************ for more information.

Testing: Using a custom handler with expected error:

.....
.....
def predict_fn(input_data, model_pack):

    print("predict_fn got input Data: {}".format(input_data))
    model = model_pack[0]
    tokenizer = model_pack[1]
    mapping_file_path = model_pack[2]

    with open(mapping_file_path) as f:
        mapping = json.load(f)

    assert False

    inputs = tokenizer.encode_plus(
        input_data,
        max_length=128,
        pad_to_max_length=True,
        add_special_tokens=True,
        return_tensors="pt",
    )
.....
.....

On deploying and making inference request, the cloudwatch logs contain the following log line: 2024-03-15T00:54:26,721 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Transform failed for model: model. Error traceback: ['Traceback (most recent call last):', ' File "/sagemaker-pytorch-inference-toolkit/src/sagemaker_inference/transformer.py", line 150, in transform', ' result = self._run_handler_function(', ' File "/sagemaker-pytorch-inference-toolkit/src/sagemaker_inference/transformer.py", line 284, in _run_handler_function', ' result = func(*argv_context)', ' File "/sagemaker-pytorch-inference-toolkit/src/sagemaker_inference/transformer.py", line 268, in _default_transform_fn', ' prediction = self._run_handler_function(self._predict_fn, *(data, model))', ' File "/sagemaker-pytorch-inference-toolkit/src/sagemaker_inference/transformer.py", line 280, in _run_handler_function', ' result = func(*argv)', ' File "/opt/ml/model/code/custom_inference.py", line 52, in predict_fn', ' assert False', 'AssertionError']

Note that the traceback is printed as a list of strings instead of a multi line string because this can cause other log statements to get interleaved with the exception traceback.