aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.11k stars 1.13k forks source link

LambdaStep in Sagemaker Pipeline times out even though the Lambda function has finished Processing #3236

Closed mohitravi123 closed 2 years ago

mohitravi123 commented 2 years ago

Describe the bug I'm using a LambdaStep in the sagemaker pipeline to run an athena query and store results in s3 which will be the input for future steps. The lambda function takes around 7 minutes to run. Even after the lambda function is successful, the LambdaStep in the sagemaker pipeline does not succeed and ultimately times out exactly at 10 minutes which is the maximum run time for the LambdaStep. I can verify that the lambda function is successful by looking at cloudwatch logs and the resulting output of the athena query in the s3 bucket.

To reproduce Create a lambda function with the following code. To mimic the athena query run time, I have added a sleep function for 7 minutes.

import json
import os
import time

def lambda_handler(event, context):    
    print("starting lambda function")

    time.sleep(420)

    print("lambda function succeeded")

    return {
        "statusCode": 200,
        "state": 'SUCCEEDED'
    }

My sagemaker pipeline is defined as follows:

# Lambda helper class can be used to create the Lambda function
func = Lambda(
    function_arn="arn:aws:lambda:{region}:{account_id}:function:lambda-test",
)

# Define LambdaStep
step_deploy_lambda = LambdaStep(
    name="LambdaStep",
    lambda_func=func
)

# Run sagemaker pipeline
pipeline_name = "lambda-step-pipeline-test" 

pipeline = Pipeline(
    name=pipeline_name,
    steps=[step_deploy_lambda],
    sagemaker_session=sagemaker_session
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

Expected behavior After the lambda function succeeds, I would expect the lambdaStep to be successful and move to the next step in the sagemaker pipeline.

Screenshots or logs In the Output tab of the lambdaStep in sagemaker studio, I can see the following message about lambda timeout.

This step failed. For more information, view the logs
ClientError: Lambda Function runtime exceeded. Maximum runtime for the Lambda function is 10 minutes.

System information A description of your system. Please provide:

Additional context The lambda function configuration is:

qidewenwhen commented 2 years ago

Hi @mohitravi123, thanks for reaching out and thanks for all these helpful code snippets and information! We've pushed a fix in our service backend to fix this issue which will take roughly 2 weeks to deploy. Will let you know once the fix takes effect.

mohitravi123 commented 2 years ago

Thanks @qidewenwhen. Once the fix is deployed, do I have to upgrade my Sagemaker version to the latest one, or will it be backward compatible with the version I'm using (2.86.2)?

qidewenwhen commented 2 years ago

Thanks @qidewenwhen. Once the fix is deployed, do I have to upgrade my Sagemaker version to the latest one, or will it be backward compatible with the version I'm using (2.86.2)?

No, you don't need to upgrade the Sagemaker version. The fix should work for v2.86.2.

rohangujarathi commented 2 years ago

The fix has been deployed successfully. Please feel free to reach out for any further questions. Thanks