Sagemaker Endpoint vanishing without traces

danielcavalli commented 2 years ago

Describe the bug I'm currently using Sagemaker to host a custom ML model deployed to two accounts, homolog, and production. Both endpoints have the same entry point code and were deployed the same day. The homologation version suddenly disappeared on June 28th, leaving no traces besides the last HealthCheck ping on CloudWatch. After searching CloudTrail logs to see what could have happened, there was nothing out of the ordinary: deployed the endpoint and that was it. No delete command coming from anywhere. I thought of it as a bug and promptly redeployed the model, on June 29th, assuming it wouldn't happen again. The issue is that on July 3rd the endpoint vanished without traces again. Same thing, no delete, no update, no renaming of anything on CloudTrail, and the only proof that it was ever on running were the CloudWatch logs and the CreateEndpoint entry on CloudTrail.

To reproduce Couldn't reproduce the bug willingly. I couldn't gather any evidence that could lead me to the cause of the problem.

Expected behavior For it not to vanish

Screenshots or logs

System information A description of your system. Please provide:

SageMaker Python SDK version: 2.72.1
Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKLearn
Framework version: 0.23
Python version: 3.6
CPU or GPU: ml.t2.medium
Custom Docker image (Y/N): N

Additional context Some more information:

The homologation(testing) endpoint wasn't called all that often and had big gaps between calls, they would only happen when we were testing something.
It was deployed through an AWS Sagemaker Notebook using the Sagemaker SDK for Python
First time the delta between a request and going offline was 8 hours, and the second time the delta was 48h.

danielcavalli commented 2 years ago

Would love the help(and to help fix it if it may be the case) on this!

danielcavalli commented 2 years ago

up

aws / sagemaker-python-sdk

Sagemaker Endpoint vanishing without traces #3252