danilop / SampleBatchProcessing

Sample Implementation of Batch Processing on Amazon Web Services (AWS)
http://danilop.github.io/SampleBatchProcessing
50 stars 8 forks source link

Alarm sounding before message deleted #3

Open Audace opened 7 years ago

Audace commented 7 years ago

I believe this issue is more of a lack of understanding on my part than clarity in the docs. I've configured a batch processing setup similar to this. Here's what has been happening:

  1. Message is sent to the queue.
  2. Alarm sounds and starts instance.
  3. Get_Jobs.py pulls down the message a t=0.
  4. The message changes from "Messages Available" to "Messages In Flight".
  5. Get_Jobs.py begins processing the message.
  6. At t=2, the message is still in "Messages In Flight" and is currently being processed on the EC2 instance but the alarm sounds and kills the EC2 instance.

It looks like the alarm is sounding 2 minutes after the message is no longer available. This would mean any job would need to take less than 2 minutes to complete or it's at risk of being killed by the alarm.

What do I need to change? Ideally, a message is pulled from the queue and added to an instance. That instance isn't killed until two minutes after the message has been deleted, not two minutes after the message is retrieved.

For context, the messages can take anywhere from 4 to 6 hours to process - I've set the visibilitytimeout to 8 hours. Does this mean the cloudwatch alarm has to wait the same amount as the VisibilityTimeout?

jens-ids commented 7 years ago

@Audace, while setting the CloudWatch alarm threshold duration to the same value as VisibilityTimeout might work, you might run into a problem if you have multiple jobs running on multiple instances. Specifically, those instances will be subject to the "scale in" rules (see http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html for details) and depending on the order of completion of the jobs, those rules may result in the wrong instance being terminated.

I am currently working on a solution to this problem that will skip the "Auto Scaling 'Down' Policy" and perform the "scale in" as the last step in my "user data" script. Something like this:

aws ec2 terminate-instances $(curl -s http://169.254.169.254/latest/meta-data/instance-id)

The downside is that this requires that the instance's IAM role support instance termination, which might be a bit dangerous, but I don't see a way around that. Any thoughts or advice would be appreciated.

The last bit to be worked out is some sort of watchdog to prevent run-away instances. Of course, this will all be a moot point as soon as AWS Batch is ready (https://aws.amazon.com/batch/).

Audace commented 7 years ago

@jens-ids, this is similar to the solution I've put in place as of late. It drops any scale-down alarms and places a shutdown -h now at the end of the user-data scripts (this doesn't require unique privileges as the boot script is being run as root anyway). For the time-being, I'll manually terminate each night, although I can probably configure a CloudWatch alarm to do cleanup on stopped instances > ~ 1 day.

One quick reminder is to remove the while loop in get-jobs.py. Otherwise, the user-data script never gets to the shutdown command.

Audace commented 7 years ago

@jens-ids, one issue that I've found is that if you shutdown or terminate the instance manually. This still doesn't change the desired number of instances, which leads to another one being created shortly after the original instance is terminated. Any idea how to get around this?