MSD-LIVE / issues

0 stars 0 forks source link

In Jupyter ECS Spawner, make task termination more robust #37

Closed clansing closed 1 year ago

clansing commented 2 years ago

The jupyter notebooks (running as a fargate task) spawned by our custom ecs spawner are supposed to be terminated automatically after 10 minutes of inactivity or 3 hours of use. Most of the time, this is working as expected. However there have been a couple of container tasks that didn't get totally removed, but were left in an inoperable state. I have tested the inactivity use case, and it has terminated successfully every time for me. I'm wondering if the problem occurs with the 3 hour max time use case. If something is running inside the notebook, could this prevent the task from being stopped? Work with Matt to figure out how this happened, so I can make my ecs spawner 'stop' method be more robust.

clansing commented 2 years ago

We need to do this ASAP because 140 of Pat's students will start using the UC Ebook thyis fall, and it will be very expensive if we have several orphaned notebook containers running.

clansing commented 2 years ago

Also, when we fix this bug, make sure to add the cloudwatch metric & alarm we created for uc ebook jupyter hub to the cdk template.

clansing commented 2 years ago

See this bug: https://github.com/boto/boto3/issues/842

clansing commented 2 years ago

For now, we decided to hook up Sentry to alert us whenever an exception happens when starting a container. If the container is orphaned, we can just manually delete it. If we find that these errors are happening frequently, we can revisit later to try to better identify if a MISSING status from the waiter really means the task did not get created. I added a comment in the code.