cloudfoundry / cloud_controller_ng

Cloud Foundry Cloud Controller
Apache License 2.0
191 stars 356 forks source link

api_worker priority exceeds beyond integer limit #1104

Closed suhlig closed 6 years ago

suhlig commented 6 years ago

Issue

Failed jobs de-prioritise beyond the integer limit of the underlying PostgreSQL DB

Context

We had a few VCAP::CloudController::Jobs::Services::DeleteOrphanedInstance that failed multiple times. It looks as if the de-prioritization that happens after each failure does not check whether it reached the limit of the underlying database column (a PostgreSQL integer in our case):

PG::NumericValueOutOfRange: ERROR:  integer out of range: 
  UPDATE "delayed_jobs" SET ... "priority" = 2147483648, "handler" = '
    --- 
    !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
    handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
    handler: !ruby/object:VCAP::CloudController::Jobs::Services::DeleteOrphanedInstance
    name: service-instance-deprovision
    ...
    "last_error" = 'Service instance : Service broker error: {
      "description"=>"Error deprovisioning instance bacdc304-9137-43a2-9dc9-4daefe3b0e0d"
    }
    /var/vcap/data/packages/cloud_controller_ng/.../cloud_controller_ng/lib/services/service_brokers/v2/client.rb:186:in `rescue in deprovision'
    ...
  '

This happened with cf-release v270.

Steps to Reproduce

Attempt to delete a service-binding managed by a broker that is not available for a long time

Expected result

A permanently failed job, or maybe just an upper bound of the priority

Current result

PG::NumericValueOutOfRange: ERROR: integer out of range

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/155697431

The labels on this github issue will be updated when the story is started.

zrob commented 6 years ago

@mattmcneeney

jenspinney commented 6 years ago

@suhlig Sorry for the delay getting back to you. I’ve spent a couple days investigating this, and I haven’t been able to figure out how this could happen.

The priority of the job does get doubled on each failure, and we should be defensive about not letting that grow indefinitely, but the DeleteOrphanedInstance job has a max_attempts value of 11, so the priority of that job should be getting capped at 2^(11-1) (1024). After it hits 11 attempts, the job should simply fail. I'm really curious about why this isn't happening for your system. If do see this issue again and are able to grep the worker logs, I'd really like to see what attempt number you're getting up to. Or do you still have access to an environment where this happens? If so, would you be able to download the cloud_controller_worker logs for the cc-worker vms and run the following grep on the combined log file?

grep "Job VCAP::CloudController::Jobs::Services::DeleteOrphanedInstance (id=.*) (queue=cc-generic) FAILED (.* prior attempts) with VCAP::Services::ServiceBrokers::V2" <your_combined_log_file>

(Or you could look inside the delayed_jobs table in postgres and look at the attempts column, whichever way is easier for your situation.)

Of course we could just fix this by capping the priority at some reasonable value, but I feel somewhat uncomfortable doing that before actually understanding the scenario where this is possible (maybe whatever is causing max_attempts to be ignored is a larger problem we need to look at).

You’re using cf-release v270, which corresponds to cloud_controller_ng SHA d2dbae2838ec0aa8d8b26e8918d4036405177716 and it seems like nothing relevant has changed in this area since that commit.

For reference, here are the steps I took:

  1. Deployed and registered a service broker to a CF hosted on a bosh-lite. (I’m using cf-deployment 1.27).
  2. Modified the service broker to return status code 500 to all requests.
  3. Attempted to provision a new service instance via the CF CLI. This initiates orphan mitigation, kicking off the same DeleteOrphanedInstance job you're getting problems from. The DeleteOrphanedInstance job will fail every time it runs because of the 500s, causing the job to continue retrying.
  4. After 11 attempts (approx 34 hours) in my environment (bosh-lite with mysql) the job is marked as failed and stops retrying. The max value priority gets to is 1024.
  5. I redeployed and replaced mysql with postgres and repeated the steps above and got the same result.

The only code modification I made was adjusting reschedule_at in the job so that I didn’t have to wait the full 34 hours.

If you’re able to grep those logs and see how many attempts are being made, or if you have any ideas of things we should be looking at, I’m very interested. At this point, I’m not able to repro the problem.

suhlig commented 6 years ago

Thanks for looking into this. Unfortunately we don't have the logs anymore and we have not seen this issue since.

About the limit - could it be that an attempt to restart a failed job via rake would not enforce the max of 11 attempts?

jenspinney commented 6 years ago

That's an interesting idea. I'll play around in my test environment and see what happens.

georgi-lozev commented 6 years ago

@suhlig @ablease and I tried the idea about running the a rake task, but there are only tasks for working off a queue. There seem to be no task to re-trigger a failed job. In result, we still have the same 11 max_attempts enforced on the DeleteOrphanedInstance job.

The only way we we're able to reproduce this was by manually updating the database entry for that job and increasing the priority value. Is there any chance for a script or a custom rake task that is part of your cf deployment which could cause a change of the job priority?

BR, SAPI Team

suhlig commented 6 years ago

Unable to reproduce. Will re-open if we see this again. Thanks for looking into this!