Closed suhlig closed 6 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/155697431
The labels on this github issue will be updated when the story is started.
@mattmcneeney
@suhlig Sorry for the delay getting back to you. I’ve spent a couple days investigating this, and I haven’t been able to figure out how this could happen.
The priority of the job does get doubled on each failure, and we should be defensive about not letting that grow indefinitely, but the DeleteOrphanedInstance job has a max_attempts
value of 11, so the priority of that job should be getting capped at 2^(11-1) (1024). After it hits 11 attempts, the job should simply fail. I'm really curious about why this isn't happening for your system. If do see this issue again and are able to grep the worker logs, I'd really like to see what attempt number you're getting up to. Or do you still have access to an environment where this happens? If so, would you be able to download the cloud_controller_worker logs for the cc-worker vms and run the following grep on the combined log file?
grep "Job VCAP::CloudController::Jobs::Services::DeleteOrphanedInstance (id=.*) (queue=cc-generic) FAILED (.* prior attempts) with VCAP::Services::ServiceBrokers::V2" <your_combined_log_file>
(Or you could look inside the delayed_jobs
table in postgres and look at the attempts
column, whichever way is easier for your situation.)
Of course we could just fix this by capping the priority at some reasonable value, but I feel somewhat uncomfortable doing that before actually understanding the scenario where this is possible (maybe whatever is causing max_attempts
to be ignored is a larger problem we need to look at).
You’re using cf-release v270, which corresponds to cloud_controller_ng SHA d2dbae2838ec0aa8d8b26e8918d4036405177716 and it seems like nothing relevant has changed in this area since that commit.
For reference, here are the steps I took:
The only code modification I made was adjusting reschedule_at
in the job so that I didn’t have to wait the full 34 hours.
If you’re able to grep those logs and see how many attempts are being made, or if you have any ideas of things we should be looking at, I’m very interested. At this point, I’m not able to repro the problem.
Thanks for looking into this. Unfortunately we don't have the logs anymore and we have not seen this issue since.
About the limit - could it be that an attempt to restart a failed job via rake
would not enforce the max of 11 attempts?
That's an interesting idea. I'll play around in my test environment and see what happens.
@suhlig @ablease and I tried the idea about running the a rake
task, but there are only tasks for working off a queue. There seem to be no task to re-trigger a failed job. In result, we still have the same 11 max_attempts
enforced on the DeleteOrphanedInstance
job.
The only way we we're able to reproduce this was by manually updating the database entry for that job and increasing the priority
value. Is there any chance for a script or a custom rake
task that is part of your cf deployment which could cause a change of the job priority?
BR, SAPI Team
Unable to reproduce. Will re-open if we see this again. Thanks for looking into this!
Issue
Failed jobs de-prioritise beyond the integer limit of the underlying PostgreSQL DB
Context
We had a few VCAP::CloudController::Jobs::Services::DeleteOrphanedInstance that failed multiple times. It looks as if the de-prioritization that happens after each failure does not check whether it reached the limit of the underlying database column (a PostgreSQL integer in our case):
This happened with cf-release v270.
Steps to Reproduce
Attempt to delete a service-binding managed by a broker that is not available for a long time
Expected result
A permanently failed job, or maybe just an upper bound of the priority
Current result
PG::NumericValueOutOfRange: ERROR: integer out of range