aria-jpl / ops_scripts2

0 stars 0 forks source link

Auto-scaling groups not scaling down with empty queues #2

Open riverma opened 3 years ago

riverma commented 3 years ago

Describe the bug

This behavior has been observed with the Rise_Limonadi and Volcano_Lundgren ASGs over the last two weeks. (Those are the expensive ones.) Also the standard product enumerator ASG.

To Reproduce Steps to reproduce the behavior:

  1. Go to the Auto Scaling Groups in the EC2 section of the AWS Management Console
  2. Wait for all jobs to finish (viewing from RabbitMQ)
  3. Note that the running instances do not terminate their instances on completion.

Expected behavior Auto scaling groups do not terminate instances when the queues are complete/empty.

riverma commented 3 years ago

@jjmcnelis please fill out the "steps to reproduce" to the best of your ability.

Questions for @marjo-luc @hookhua @pymonger:

riverma commented 3 years ago

Suggestions of logs to look further into (from @marjo-luc and @pymonger):

riverma commented 3 years ago

Just noting this down - maybe another ticket - but @jjmcnelis we should have a checklist procedure for modification of any ASG configuration - including follow-up checks to ensure proper scale down if machinery fails.

Potentially document new items within: https://aria.atlassian.net/wiki/spaces/ARIA/pages/230064129/Operations+Checklist

riverma commented 3 years ago

@marjo-luc - can you comment if this script will assist here? https://github.com/aria-jpl/ops_scripts2/blob/master/aws_scripts/zero_all_asgs.py

marjo-luc commented 3 years ago

The 'zero_all_asgs' script will only address the symptoms of the issue and not the root cause. I do not recommend using it. It would be best to monitor the hysds logs referred to above when this issue comes up again. I'd also recommend looking at the EC2 activity on the aws console.

riverma commented 3 years ago

@jjmcnelis FYI for the above.

riverma commented 3 years ago

Suggestion from @marjo-luc for @jjmcnelis to scope and share harikari logs here.

Some context:

Next steps:

pymonger commented 3 years ago

Are there any instances still running at the moment? We can do a quick tagup to investigate.

riverma commented 3 years ago

@jjmcnelis - @pymonger had a quick question about whether there are any problematic EC2 instances still running above if you could comment, thanks!