Auto-scaling groups not scaling down with empty queues

riverma commented 3 years ago

Describe the bug

From @jjmcnelis: "Some auto scaling groups in AWS for standard product pipeline aren’t scaling down when the queues empty."
Auto Scaling groups do not scale down

This behavior has been observed with the Rise_Limonadi and Volcano_Lundgren ASGs over the last two weeks. (Those are the expensive ones.) Also the standard product enumerator ASG.

To Reproduce Steps to reproduce the behavior:

Go to the Auto Scaling Groups in the EC2 section of the AWS Management Console
Wait for all jobs to finish (viewing from RabbitMQ)
Note that the running instances do not terminate their instances on completion.

Expected behavior Auto scaling groups do not terminate instances when the queues are complete/empty.

riverma commented 3 years ago

@jjmcnelis please fill out the "steps to reproduce" to the best of your ability.

Questions for @marjo-luc @hookhua @pymonger:

How much expected time is set as the default for ARIA, to trigger ASG scale down upon empty queues?
Which configuration files set this trigger?
Where can we see background information on a tool called "harakiri" - which was mentioned as taking responsibility for auto-scale down actions?

riverma commented 3 years ago

Suggestions of logs to look further into (from @marjo-luc and @pymonger):

/export/home/hysdsops/mozart/log for EC2 logs
/var/log/messages for "harikiri" logs

riverma commented 3 years ago

Just noting this down - maybe another ticket - but @jjmcnelis we should have a checklist procedure for modification of any ASG configuration - including follow-up checks to ensure proper scale down if machinery fails.

Potentially document new items within: https://aria.atlassian.net/wiki/spaces/ARIA/pages/230064129/Operations+Checklist

riverma commented 3 years ago

@marjo-luc - can you comment if this script will assist here? https://github.com/aria-jpl/ops_scripts2/blob/master/aws_scripts/zero_all_asgs.py

marjo-luc commented 3 years ago

The 'zero_all_asgs' script will only address the symptoms of the issue and not the root cause. I do not recommend using it. It would be best to monitor the hysds logs referred to above when this issue comes up again. I'd also recommend looking at the EC2 activity on the aws console.

riverma commented 3 years ago

@jjmcnelis FYI for the above.

riverma commented 3 years ago

Suggestion from @marjo-luc for @jjmcnelis to scope and share harikari logs here.

Some context:

Next steps:

If the harikari logs indicate a HySDS core or adaptation issue, dev team to reproduce and resolve
If dev team indicates a configuration issue to blame, then suggestions to ops for what to do.

pymonger commented 3 years ago

Are there any instances still running at the moment? We can do a quick tagup to investigate.

riverma commented 3 years ago

@jjmcnelis - @pymonger had a quick question about whether there are any problematic EC2 instances still running above if you could comment, thanks!

aria-jpl / ops_scripts2

Auto-scaling groups not scaling down with empty queues #2