aws / sagemaker-tensorflow-training-toolkit

Toolkit for running TensorFlow training scripts on SageMaker. Dockerfiles used for building SageMaker TensorFlow Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
270 stars 160 forks source link

Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check. #401

Open vishwath96 opened 4 years ago

vishwath96 commented 4 years ago

Trying to deploy a custom Word2Vec model that I've trained offline as a SageMaker endpoint. Followed the documentation - https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own to create docker file and everything.

I've added the following in docker file - # ENTRYPOINT ["python3", "/usr/local/bin/predictor.py"]

Looking at the logs, I am able to see that this code is running and I am able to load the model, but the model isn't getting deployed and fails with the error - Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check.

Any help?

ajaykarpur commented 4 years ago

Hi @vishwath96, are you able to share your logs and the full stack trace?

jocelynbaduria commented 2 years ago

Hi I am having the same error. I am deploying my own dlib model. The cloud watch logs is this What does it means?

2022/06/15 21:08:37 [error] 19#19: *1 js: failed ping{ "error": "Servable not found for request: Latest(persona-id)" }

Kindly help. Thank you.

priyakhokher commented 2 years ago

@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I run into this error.

Starting the training.
Traceback (most recent call last):
  File "/opt/ml/train", line 55, in train
    with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:

OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/ml/train", line 72, in <module>
    train()
  File "/opt/ml/train", line 64, in train
    with open(os.path.join(output_path, 'failure'), 'w') as s:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'
ankitvirla commented 2 years ago

@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I run into this error.

Starting the training.
Traceback (most recent call last):
  File "/opt/ml/train", line 55, in train
    with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:

OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/ml/train", line 72, in <module>
    train()
  File "/opt/ml/train", line 64, in train
    with open(os.path.join(output_path, 'failure'), 'w') as s:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'

Hi, For resolving it. in docker container inside the "/opt/ml/output/" directory there should be a file with the name of failure. And this is occurring because the training is going to be failed for some reason.

priyakhokher commented 2 years ago

@birla8319 the error is this statement: OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl' and I see this puzzle under my cloudwatch logs. The model pickle files are dumped in s3 and I don't see /opt/ml/output/failure results dumped in S3 either.