aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.14k stars 6.78k forks source link

[Please Help!] Issue with deploying a custom model pipeline #1072

Open Superhzf opened 4 years ago

Superhzf commented 4 years ago

Hi, I'm trying to deploy a custom model pipeline using sagemaker.pipeline.PipelineModel. The pipeline model includes two parts, raw data preprocessing and inference. I use the build-in sklearn container to do preprocess and a custom lightgbm container to train the model. Below is the sample code:

raw_data_preprocess_inferencee_model = sklearn_preprocessor.create_model()
lightgbm_model = clf.create_model()
model_name = 'inference-pipeline-' + timestamp_prefix
endpoint_name = 'inference-pipeline-ep-' + timestamp_prefix
model = PipelineModel(
     name=model_name, 
     role=role,
     predictor_cls = sagemaker_session,
     models=[
         raw_data_preprocess_inferencee_model, 
         lightgbm_model])
 predictor = model.deploy(initial_instance_count=1,
                                      instance_type = 'ml.m4.4xlarge',
                                      endpoint_name=endpoint_name)

The lightgbm container is created following this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own

The raw input data preprocessing is created following this one: https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/

Error message:

Error hosting endpoint inference-pipeline-ep-2020-03-02-15-52-08: Failed. Reason: The container-2 for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

What I did to figure out the problem:

  1. The CloudWatch log of container-2 looks fine:

17:00:21 Starting the inference server with 4 workers. 17:00:21 [2020-03-02 17:00:18 +0000] [17] [INFO] Starting gunicorn 20.0.0 17:00:21 [2020-03-02 17:00:18 +0000] [17] [INFO] Listening at: unix:/tmp/gunicorn.sock (17) 17:00:21 [2020-03-02 17:00:18 +0000] [17] [INFO] Using worker: gevent 17:00:21 [2020-03-02 17:00:18 +0000] [22] [INFO] Booting worker with pid: 22 17:00:21 [2020-03-02 17:00:18 +0000] [23] [INFO] Booting worker with pid: 23 17:00:21 [2020-03-02 17:00:18 +0000] [24] [INFO] Booting worker with pid: 24 17:00:21 [2020-03-02 17:00:18 +0000] [25] [INFO] Booting worker with pid: 25

  1. I tried to use an instance with a larger memory (64GB) which gave me the same error.

Please let me know what else you need from me to figure out the problem


Update: I can deploy raw_data_preprocess_inferencee_model and lightgbm_model in two different endpoint without problems

leninkumar-sv-tiger commented 4 years ago

Has this issue been resolved? I am also facing the exact same issue

leninkumar-sv-tiger commented 4 years ago

@Superhzf were you able to resolve this issue ?

Superhzf commented 4 years ago

@leninkumar-sv-tiger unfortunately, no.

alepmaros commented 3 years ago

I've run into similar issues regarding this health check problem with a custom container and I've solved with the following changes.

When running an inference pipeline, Sagemaker will require you to create your server using a different port than the default 8080, and that port will be set in the SAGEMAKER_BIND_TO_PORT variable in the environment.

So you will need to do something like:

sm_bind_to_port = os.environ.get('SAGEMAKER_BIND_TO_PORT', '8080')

and then when you'll create your server app, you need to specify that port, for instance:

gunicorn -b '0.0.0.0:{sm_bind_to_port}' ...

I'm not super familiar with how your container handles the ports, but at least in my case, that's what fixed it.