PolicyStat / hubot

2 stars 0 forks source link

Build statuses sometimes lost when hubot crashes #3

Open stemchan opened 11 years ago

stemchan commented 11 years ago

Sometimes, Hubot crashes at just the right time so that when it restarts, it receives the status update from Jenkins before Redis has started, which means that build isn't recorded and we'll have Github issue statuses stuck as pending even after it's completed. We should find a way to ensure Redis is running before handling the posts from Jenkins.

winhamwr commented 11 years ago

One option would be the use the SNS plugin instead of the Jenkins notification plugin. Amazon SNS supports retrying with custom Delivery Policies. We could set it to retry every 20 seconds for the next 5 minutes, for example.

That would allow us to just run a check at the notification endpoint and if redis isn't accessible, just return a 503 so that SNS will retry again shortly.

winhamwr commented 10 years ago

While hacking on #6, I think I figured out an easier way to get to a 90% solution to this. The TODO is here.

# TODO: We could recover here by:
# 1. Crawling to the parent job and then getting/storing the root job data
# 2. After that, kicking off something to crawl the downstream jobs and
# actually poll for their statuses, just to catch up with anything we might
# have missed. That ability would also go 80% of the way towards building
# something that we could run on start to handle any missed notifications
# while hubot was down

Basically, in the handleFinishedDownstreamJob, if we encounter a notification for a job that we hit the API for that job to get to its upstream (the root job). Then we gather all of the normal info and fire off async requests to look for any downstream job notifications we missed.

The second bit would be to add a robot.on("running", ...) call using the on method. It's job would be to look in the brain for any jobs that haven't yet completed, and then crawl all of their downstream jobs. Adding that might mean we need to do a little bit better about cleaning up after ourselves if we manually cancel things, but I can't actually think of a situation where the combination of those two addition would allow us to miss jobs.