Describe the bug
I'm currently using Sagemaker to host a custom ML model deployed to two accounts, homolog, and production. Both endpoints have the same entry point code and were deployed the same day. The homologation version suddenly disappeared on June 28th, leaving no traces besides the last HealthCheck ping on CloudWatch. After searching CloudTrail logs to see what could have happened, there was nothing out of the ordinary: deployed the endpoint and that was it. No delete command coming from anywhere.
I thought of it as a bug and promptly redeployed the model, on June 29th, assuming it wouldn't happen again. The issue is that on July 3rd the endpoint vanished without traces again. Same thing, no delete, no update, no renaming of anything on CloudTrail, and the only proof that it was ever on running were the CloudWatch logs and the CreateEndpoint entry on CloudTrail.
To reproduce
Couldn't reproduce the bug willingly. I couldn't gather any evidence that could lead me to the cause of the problem.
Expected behavior
For it not to vanish
Screenshots or logs
System information
A description of your system. Please provide:
SageMaker Python SDK version: 2.72.1
Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKLearn
Framework version: 0.23
Python version: 3.6
CPU or GPU: ml.t2.medium
Custom Docker image (Y/N): N
Additional context
Some more information:
The homologation(testing) endpoint wasn't called all that often and had big gaps between calls, they would only happen when we were testing something.
It was deployed through an AWS Sagemaker Notebook using the Sagemaker SDK for Python
First time the delta between a request and going offline was 8 hours, and the second time the delta was 48h.
Describe the bug I'm currently using Sagemaker to host a custom ML model deployed to two accounts, homolog, and production. Both endpoints have the same entry point code and were deployed the same day. The homologation version suddenly disappeared on June 28th, leaving no traces besides the last HealthCheck ping on CloudWatch. After searching CloudTrail logs to see what could have happened, there was nothing out of the ordinary: deployed the endpoint and that was it. No delete command coming from anywhere. I thought of it as a bug and promptly redeployed the model, on June 29th, assuming it wouldn't happen again. The issue is that on July 3rd the endpoint vanished without traces again. Same thing, no delete, no update, no renaming of anything on CloudTrail, and the only proof that it was ever on running were the CloudWatch logs and the CreateEndpoint entry on CloudTrail.
To reproduce Couldn't reproduce the bug willingly. I couldn't gather any evidence that could lead me to the cause of the problem.
Expected behavior For it not to vanish
Screenshots or logs
System information A description of your system. Please provide:
Additional context Some more information: