aws-samples / sagemaker-studio-auto-shutdown-extension

Apache License 2.0
142 stars 38 forks source link

Constantly shutting down T3 instances #11

Open keepler-cesarsilgo opened 3 years ago

keepler-cesarsilgo commented 3 years ago

We are successfully running the extension except for T3 instances. When we try to span a new t3.medium instance from SageMaker Studio it gets always stopped just after being available.

We have been able to replicate this behaviour in all our User Profiles and environments with and without the "Keep Terminals" option enabled.

Maybe this issue has some relation with T3 instances burst capacity architecture?

Thanks & Regards.

garzoand commented 3 years ago

Thank you for reporting this issue. We have a look into that asap.

garzoand commented 3 years ago

@keepler-cesarsilgo I have try to reproduce the issue by creating a new notebook with a t3.medium instance, however, was not able to replicate the same issue. What I am missing? Can you please share a step-by-step guide on how to replicate? Thank you.

keepler-cesarsilgo commented 3 years ago

@garzoand It works for a few days but after that it deletes every application running on a T3 when you try to span the instance.

Can't tell you the exact amount of time you have to wait, but we set up the last version of the plugin on Jul Fri 2nd in ten different User Profiles and it begun to fail on Jul Fri 9th in all of them. We had the same issue with the previous version, also after a few days. In case it helps, our time zone is CET.

To replicate the bug just create a new Notebook (we are using the Python 3 Data Science Kernel) using a ml.t3.medium instance.

garzoand commented 3 years ago

Can you please send me the jupyter server logs? If you open up a System Terminal, you will find it under the /var/log/app folder. Thank you.

keepler-cesarsilgo commented 3 years ago

Sure, let us wait a few days until the extension fails again and I'll send you the logs. Thanks!

keepler-cesarsilgo commented 3 years ago

app_container.log-20210729-1627534861.txt

Hi @garzoand! This is the log file. Apparently no errors, but the extension in failing again. We have replicated the issue in another AWS account with a new SageMaker Studio.

rthamman commented 3 years ago

Hi Keepler, sorry for the delay in getting back to you. We have a scaled down version now. If you are still running into the same issue, please give this a try:

Slim down extension version that does not need Internet access during installation. Please use this script - https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples/blob/main/scripts/install-autoshutdown-server-extension/on-jupyter-server-start.sh.

We also launched Lifecycle Configuration (LCC) support for SageMaker studio. You can add the above script to JupyterServer LCC. See this doc link for more information - https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lcc.html. This will allow you to automate and eliminate manual steps.

Let us know how it goes.