aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.14k forks source link

Sagemaker Endpoint vanishing without traces #3252

Open danielcavalli opened 2 years ago

danielcavalli commented 2 years ago

Describe the bug I'm currently using Sagemaker to host a custom ML model deployed to two accounts, homolog, and production. Both endpoints have the same entry point code and were deployed the same day. The homologation version suddenly disappeared on June 28th, leaving no traces besides the last HealthCheck ping on CloudWatch. After searching CloudTrail logs to see what could have happened, there was nothing out of the ordinary: deployed the endpoint and that was it. No delete command coming from anywhere. I thought of it as a bug and promptly redeployed the model, on June 29th, assuming it wouldn't happen again. The issue is that on July 3rd the endpoint vanished without traces again. Same thing, no delete, no update, no renaming of anything on CloudTrail, and the only proof that it was ever on running were the CloudWatch logs and the CreateEndpoint entry on CloudTrail.

To reproduce Couldn't reproduce the bug willingly. I couldn't gather any evidence that could lead me to the cause of the problem.

Expected behavior For it not to vanish

Screenshots or logs

System information A description of your system. Please provide:

Additional context Some more information:

danielcavalli commented 2 years ago

Would love the help(and to help fix it if it may be the case) on this!

danielcavalli commented 2 years ago

up