Open reidmit opened 3 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/175753865
The labels on this github issue will be updated when the story is started.
I'm afraid we might be at the mercy of the DelayedJob implementation of callback hooks for this one.
The failure
callback occurs when a job fails the number of times defined in max_attempts
, in this case I believe it is one: https://github.com/cloudfoundry/cloud_controller_ng/blob/3f0c5f3b9be88f5ad16adf40828d5db2616f51fa/app/jobs/delete_action_job.rb#L35-L37
Looking at the code in DelayedJob I can't see why failed
would have been called before error
, but it's possible there is a race condition I am not seeing.
I would be curious to know if this failure was somehow a deserialization error. From my reading of the code, a deserialization error might always result in a failed PollableJob with nothing in the error column, as DelayedJob doesn't seem to call the error hook in that case at all.
Nevertheless we don't have control over the failure
hook and can't guarantee that there will be an error at this point.
Issue
It seems that there are edge cases where a job will fail (i.e. change state to
FAILED
), but itserrors
field will be empty.Context
See this report on the CF CLI: https://github.com/cloudfoundry/cli/issues/2088
In that case, the user was deleting an org. The deletion failed, but the job ended up with an empty
errors
field.The CLI did not handle this gracefully, since it assumed there would always be at least one error there. I believe the CLI is making a change to handle this edge case gracefully (i.e. without panicing), but it still feels like a bug in CC.
If you're curious, this is what will now be surfaced by the CLI in the case that a job fails without any real info to show.
Steps to Reproduce
Unsure, beyond what was written in that CLI issue :(
I think this is a tricky race condition.
Expected result
If a job fails, it should always have at least one error object listed in
errors
.Current result
There is at least one case where
errors
ends up empty. I strongly suspect this will be hard to reproduce/track down. See my theories under "Possible Fix".Possible Fix
I'm struggling to find solid proof here, but my theory is that there are race conditions where a job's
state
gets updated toFAILED
before anyerrors
are associated with it in the DB. Clients like the CLI are typically polling for state change, so as soon as it becomesFAILED
, they will try to read the errors off it.I remember some conversation around this sort of thing when we added the
warnings
field to V3 jobs: https://github.com/cloudfoundry/cloud_controller_ng/commit/8cb7300ba485716c7ec09ff09f3ebf8181788ff3#diff-2236f8b5d3eadf13a963e2da082b9b1b4be04c9a83b2d418a01616e7e7edd8a2