Open Audace opened 7 years ago
@Audace, while setting the CloudWatch alarm threshold duration to the same value as VisibilityTimeout might work, you might run into a problem if you have multiple jobs running on multiple instances. Specifically, those instances will be subject to the "scale in" rules (see http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html for details) and depending on the order of completion of the jobs, those rules may result in the wrong instance being terminated.
I am currently working on a solution to this problem that will skip the "Auto Scaling 'Down' Policy" and perform the "scale in" as the last step in my "user data" script. Something like this:
aws ec2 terminate-instances $(curl -s http://169.254.169.254/latest/meta-data/instance-id)
The downside is that this requires that the instance's IAM role support instance termination, which might be a bit dangerous, but I don't see a way around that. Any thoughts or advice would be appreciated.
The last bit to be worked out is some sort of watchdog to prevent run-away instances. Of course, this will all be a moot point as soon as AWS Batch is ready (https://aws.amazon.com/batch/).
@jens-ids, this is similar to the solution I've put in place as of late. It drops any scale-down alarms and places a shutdown -h now
at the end of the user-data scripts (this doesn't require unique privileges as the boot script is being run as root anyway). For the time-being, I'll manually terminate each night, although I can probably configure a CloudWatch alarm to do cleanup on stopped instances > ~ 1 day.
One quick reminder is to remove the while loop in get-jobs.py. Otherwise, the user-data script never gets to the shutdown command.
@jens-ids, one issue that I've found is that if you shutdown or terminate the instance manually. This still doesn't change the desired number of instances, which leads to another one being created shortly after the original instance is terminated. Any idea how to get around this?
I believe this issue is more of a lack of understanding on my part than clarity in the docs. I've configured a batch processing setup similar to this. Here's what has been happening:
It looks like the alarm is sounding 2 minutes after the message is no longer available. This would mean any job would need to take less than 2 minutes to complete or it's at risk of being killed by the alarm.
What do I need to change? Ideally, a message is pulled from the queue and added to an instance. That instance isn't killed until two minutes after the message has been deleted, not two minutes after the message is retrieved.
For context, the messages can take anywhere from 4 to 6 hours to process - I've set the visibilitytimeout to 8 hours. Does this mean the cloudwatch alarm has to wait the same amount as the VisibilityTimeout?