aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.14k stars 6.78k forks source link

UnexpectedStatusException: Error hosting endpoint xxxxx: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. #3062

Open kevalshah90 opened 2 years ago

kevalshah90 commented 2 years ago

Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/xgboost_bring_your_own_model/xgboost_bring_your_own_model.ipynb

https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py

I am using the example from the notebooks to create and deploy an endpoint to AWS SageMaker Cloud. I have passed all the checks locally and when I attempt to deploy the endpoint I run into the issue.

Describe the bug and Logs

UnexpectedStatusException: Error hosting endpoint sagemaker-xgboost: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.. Full Traceback from the cloudwatch logs:

Traceback (most recent call last):  File "/miniconda3/bin/serve", line 8, in <module>    sys.exit(serving_entrypoint())  File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/serving.py", line 128, in serving_entrypoint    server.start(env.ServingEnv().framework_module)  File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_server.py", line 86, in start    _modules.import_module(env.module_dir, env.module_name)  File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_modules.py", line 253, in import_module    _files.download_and_extract(uri, _env.code_dir)  File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_files.py", line 129, in download_and_extract    s3_download(uri, dst)  File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_files.py", line 165, in s3_download    s3.Bucket(bucket).download_file(key, dst)  File "/miniconda3/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file    ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)  File "/miniconda3/lib/python3.6/site-packages/boto3/s3/inject.py", line 172, in download_file    extra_args=ExtraArgs, callback=Callback)  File "/miniconda3/lib/python3.6/site-packages/boto3/s3/transfer.py", line 307, in download_file    future.result()  File "/miniconda3/lib/python3.6/site-packages/s3transfer/futures.py", line 106, in result    return self._coordinator.result()  File "/miniconda3/lib/python3.6/site-packages/s3transfer/futures.py", line 265, in result    raise self._exception  File "/miniconda3/lib/python3.6/site-packages/s3transfer/tasks.py", line 255, in _main    self._submit(transfer_future=transfer_future, **kwargs)  File "/miniconda3/lib/python3.6/site-packages/s3transfer/download.py", line 343, in _submit    **transfer_future.meta.call_args.extra_args  File "/miniconda3/lib/python3.6/site-packages/botocore/client.py", line 357, in _api_call    return self._make_api_call(operation_name, kwargs)  File "/miniconda3/lib/python3.6/site-packages/botocore/client.py", line 661, in _make_api_call    raise error_class(parsed_response, operation_name)
--
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

To reproduce

In my local notebook (my personal machine NOT sagemaker notebook):

    import pandas
    import xgboost
    from xgboost import XGBRegressor
    import numpy as np
    from sklearn.model_selection import train_test_split, RandomizedSearchCV

    print(xgboost.__version__)
    1.0.1

    # read data
    df = pd.read_csv('') 

    # split df into train and test
    X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.1)

    # Encode categorical variables

    cat_vars = [List of categorical variables]
    cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')

    encoder = cat_transform.fit(X_train)
    X_train = encoder.transform(X_train)
    X_test = encoder.transform(X_test)

    X_train.shape
    (2000,100)

    # xgboost regression model
    model = XGBRegressor(objective = 'reg:squarederror')

    # Parameter distributions

    params = { 
              xxxxx: xxx
              ... 
              ...
    }

    # Hyperparameter tuning
    r = RandomizedSearchCV(model, param_distributions=params, n_iter=10, scoring="neg_mean_absolute_error", cv=3, verbose=1, n_jobs=1, return_train_score=True, error_score='raise')

    # Fit model
    r.fit(X_train.toarray(), y_train.values)

    xgbest = r.best_estimator
# AWS SageMaker Endpoint code
import boto3
import pickle
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from time import gmtime, strftime

region = boto3.Session().region_name

role = 'arn:aws:iam::111:role/xxx-sagemaker-role'

bucket = 'ml-model'
prefix = "sagemaker/xxx-xgboost-byo"
bucket_path = "https://s3-{}.amazonaws.com/{}".format('us-west-1', 'ml-model')

client = boto3.client(
    's3',
    aws_access_key_id=xxx
    aws_secret_access_key=xxx
)
client.list_objects(Bucket=bucket)

Save the model

# save the model, either xgbest 
model_file_name = "xgboost-model"

# using save_model
# xgb_model.save_model(model_file_name)

pickle.dump(xgbest, open(model_file_name, 'wb'))`

!tar czvf xgboost_model.tar.gz $model_file_name

Upload to S3

key = 'xgboost_model.tar.gz'

with open('xgboost_model.tar.gz', 'rb') as f:
    client.upload_fileobj(f, bucket, key)

Import model

# Import model into hosting
container = get_image_uri(boto3.Session().region_name, "xgboost", "0.90-2")
print(container)

xxxxxx.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:0.90-2-cpu-py3
%%time

model_name = model_file_name + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_url = "https://s3-{}.amazonaws.com/{}/{}".format(region, bucket, key)

from sagemaker.xgboost import XGBoost, XGBoostModel
from sagemaker.session import Session
from sagemaker.local import LocalSession

sm_client = boto3.client(
                         "sagemaker",
                         region_name="us-west-1",
                         aws_access_key_id='xxxx',
                         aws_secret_access_key='xxxx'
                        )

# Define session
sagemaker_session = Session(sagemaker_client = sm_client)

models3_uri = "s3://ml-model/xgboost_model.tar.gz"

xgb_inference_model = XGBoostModel(
                                   model_data=models3_uri,
                                   role=role,
                                   entry_point="inference.py",
                                   framework_version="0.90-2",
                                   # Cloud
                                   sagemaker_session = sagemaker_session
                                   # Local
                                   # sagemaker_session = None

)

#serializer = StringSerializer(content_type="text/csv")
predictor = xgb_inference_model.deploy(
                                       initial_instance_count = 1,
                                       # Cloud
                                       instance_type="ml.t2.large",
                                       # Local
                                       # instance_type = "local",
                                       serializer = "text/csv"
)

if xgb_inference_model.sagemaker_session.local_mode == True:
    print('Deployed endpoint in local mode')
else:
    print('Deployed endpoint to SageMaker AWS Cloud')

/Applications/Anaconda/anaconda3/lib/python3.9/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
   3354         if status != "InService":
   3355             reason = desc.get("FailureReason", None)
-> 3356             raise exceptions.UnexpectedStatusException(
   3357                 message="Error hosting endpoint {endpoint}: {status}. Reason: {reason}.".format(
   3358                     endpoint=endpoint, status=status, reason=reason

UnexpectedStatusException: Error hosting endpoint sagemaker-xgboost-xxxx: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

If applicable, add logs to help explain your problem. You may also attach an .ipynb file to this issue if it includes relevant logs or output.

MB-MuratBayraktar commented 1 year ago

Hi, did you solve this issue? im facing the same error