This feature would periodically check to see if there are any long-running or idle clusters that may indicate a job is stuck or failed and the cluster did not get shutdown properly. If any such condition is detected, and email would be sent to the account admin and the user who ran the job. The email would include instructions on how to terminate the cluster both through the web application, and, if that fails, how to terminate the cluster from the AWS console.
[ ] Define criteria to identify a stuck cluster condition.
[ ] Write code to look for CloudFormation stacks that match the Job ID format that are stuck.
[ ] Create an email template with shutdown instructions
[ ] Write code to send an email to the admin and user who started the job.
[ ] Create a Lambda function in the CF template to define a new function with appropriate permissions in its IAM role.
[ ] Create a CloudWatch (EventBridge) Event in the CF template to call the Lambda function (hourly?).
Acceptance Testing
Pass unit tests with good code coverage.
Check that the system detects stuck clusters and sends email notifications.
Time Estimate
16h
Sub-Issues
Consider breaking the new feature down into sub-issues.
[ ] Add a checkbox for each sub-issue here.
Relevant Deadlines
List relevant project deadlines here or state NONE.
Define the Metadata
Assignee
[ ] Select engineer(s) or no engineer required
[ ] Select scientist(s) or no scientist required
Labels
[ ] Select component(s)
[ ] Select priority
Projects and Milestone
[ ] Select Project
[ ] Select Milestone as the next official version or Backlog of Development Ideas
New Feature Checklist
[ ] Complete the issue definition above, including the Time Estimate and Funding source.
[ ] Fork this repository or create a branch of develop.
Branch name: feature_<Issue Number>/<Description>
[ ] Complete the development and test your changes.
[ ] Add/update log messages for easier debugging.
[ ] Add/update tests.
[ ] Add/update documentation.
[ ] Push local changes to GitHub.
[ ] Submit a pull request to merge into develop.
Pull request: feature <Issue Number> <Description>
[ ] Define the pull request metadata, as permissions allow.
Select: Reviewer(s), Project, and Development issue
Select: Milestone as the next official version
[ ] Iterate until the reviewer(s) accept and merge your changes.
Describe the New Feature
This feature would periodically check to see if there are any long-running or idle clusters that may indicate a job is stuck or failed and the cluster did not get shutdown properly. If any such condition is detected, and email would be sent to the account admin and the user who ran the job. The email would include instructions on how to terminate the cluster both through the web application, and, if that fails, how to terminate the cluster from the AWS console.
Acceptance Testing
Pass unit tests with good code coverage. Check that the system detects stuck clusters and sends email notifications.
Time Estimate
16h
Sub-Issues
Consider breaking the new feature down into sub-issues.
Relevant Deadlines
List relevant project deadlines here or state NONE.
Define the Metadata
Assignee
Labels
Projects and Milestone
New Feature Checklist
feature_<Issue Number>/<Description>
feature <Issue Number> <Description>