Open riverma opened 3 years ago
@jjmcnelis please fill out the "steps to reproduce" to the best of your ability.
Questions for @marjo-luc @hookhua @pymonger:
Suggestions of logs to look further into (from @marjo-luc and @pymonger):
/export/home/hysdsops/mozart/log
for EC2 logs/var/log/messages
for "harikiri" logsJust noting this down - maybe another ticket - but @jjmcnelis we should have a checklist procedure for modification of any ASG configuration - including follow-up checks to ensure proper scale down if machinery fails.
Potentially document new items within: https://aria.atlassian.net/wiki/spaces/ARIA/pages/230064129/Operations+Checklist
@marjo-luc - can you comment if this script will assist here? https://github.com/aria-jpl/ops_scripts2/blob/master/aws_scripts/zero_all_asgs.py
The 'zero_all_asgs' script will only address the symptoms of the issue and not the root cause. I do not recommend using it. It would be best to monitor the hysds logs referred to above when this issue comes up again. I'd also recommend looking at the EC2 activity on the aws console.
@jjmcnelis FYI for the above.
Suggestion from @marjo-luc for @jjmcnelis to scope and share harikari logs here.
Some context:
Next steps:
Are there any instances still running at the moment? We can do a quick tagup to investigate.
@jjmcnelis - @pymonger had a quick question about whether there are any problematic EC2 instances still running above if you could comment, thanks!
Describe the bug
This behavior has been observed with the Rise_Limonadi and Volcano_Lundgren ASGs over the last two weeks. (Those are the expensive ones.) Also the standard product enumerator ASG.
To Reproduce Steps to reproduce the behavior:
Expected behavior Auto scaling groups do not terminate instances when the queues are complete/empty.