cloudfoundry / cloud_controller_ng

Cloud Foundry Cloud Controller
Apache License 2.0
191 stars 357 forks source link

V3 jobs may end up `FAILED` with an empty `errors` field #1959

Open reidmit opened 3 years ago

reidmit commented 3 years ago

Issue

It seems that there are edge cases where a job will fail (i.e. change state to FAILED), but its errors field will be empty.

Context

See this report on the CF CLI: https://github.com/cloudfoundry/cli/issues/2088

In that case, the user was deleting an org. The deletion failed, but the job ended up with an empty errors field.

The CLI did not handle this gracefully, since it assumed there would always be at least one error there. I believe the CLI is making a change to handle this edge case gracefully (i.e. without panicing), but it still feels like a bug in CC.

If you're curious, this is what will now be surfaced by the CLI in the case that a job fails without any real info to show.

Steps to Reproduce

Unsure, beyond what was written in that CLI issue :(

I think this is a tricky race condition.

Expected result

If a job fails, it should always have at least one error object listed in errors.

Current result

There is at least one case where errors ends up empty. I strongly suspect this will be hard to reproduce/track down. See my theories under "Possible Fix".

Possible Fix

I'm struggling to find solid proof here, but my theory is that there are race conditions where a job's state gets updated to FAILED before any errors are associated with it in the DB. Clients like the CLI are typically polling for state change, so as soon as it becomes FAILED, they will try to read the errors off it.

I remember some conversation around this sort of thing when we added the warnings field to V3 jobs: https://github.com/cloudfoundry/cloud_controller_ng/commit/8cb7300ba485716c7ec09ff09f3ebf8181788ff3#diff-2236f8b5d3eadf13a963e2da082b9b1b4be04c9a83b2d418a01616e7e7edd8a2

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175753865

The labels on this github issue will be updated when the story is started.

sethboyles commented 3 years ago

I'm afraid we might be at the mercy of the DelayedJob implementation of callback hooks for this one.

The failure callback occurs when a job fails the number of times defined in max_attempts, in this case I believe it is one: https://github.com/cloudfoundry/cloud_controller_ng/blob/3f0c5f3b9be88f5ad16adf40828d5db2616f51fa/app/jobs/delete_action_job.rb#L35-L37

Looking at the code in DelayedJob I can't see why failed would have been called before error, but it's possible there is a race condition I am not seeing.

I would be curious to know if this failure was somehow a deserialization error. From my reading of the code, a deserialization error might always result in a failed PollableJob with nothing in the error column, as DelayedJob doesn't seem to call the error hook in that case at all.

Nevertheless we don't have control over the failure hook and can't guarantee that there will be an error at this point.