Closed agrare closed 10 months ago
@agrare Just for my clarity, what causes each of those error states?
@Fryguy I reproduced by just typo'ing an image name, first you get ErrImagePull
then after a period of time the pod will be in ImagePullBackOff
so really the same root cause can be two different errors depending on when we check.
@agrare could you add a test that shows the output
will end up as State.Timeout
or what ever
Not sure I understand how this works...does the fact that it's no longer running?
trigger some next step to say it's an error?
Not sure I understand how this works...does the fact that it's no longer running? trigger some next step to say it's an error?
The problem with these "failed" containers is that the pod is Pending, we were considering Pending to be a running state since that is the same state a pod which just hasn't started yet is in. This was causing failed pods to "hang" since we were waiting for kubernetes to start them.
Once a pod is no longer running we then check success? and get the output, so by marking these failed pods as not running we allow it to go through the normal process for a completed pod.
The workflow will probably want to display a reasonable cause
.
Not sure how we can expose this unless we tweak output
to put in this error?
@kbrock I've updated output to return a properly formatted error with "Error"
and "Cause"
keys
Here's an example of a failed workflow's output
{"Error"=>"ErrImagePull", "Cause"=>"rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/agrare/hello-worl:latest\": failed to resolve reference \"docker.io/agrare/hello-worl:latest\": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed"}
And I can add Fail state matchers:
"NextState": {
"Type": "Task",
"Resource": "docker://docker.io/agrare/hello-worl:latest",
"Secrets": ["vmdb:aaa-bbb-ccc"],
"End": true,
"Catch": [
{
"ErrorEquals": [ "ErrImagePull", "ImagePullBackOff" ],
"Next": "ImageFailState"
}
]
},
"ImageFailState": {
"Type": "Fail",
"Error": "ImageFailStateError",
"Cause": "ImageFailed"
}
Running state: [ImageFailState]
{"Error"=>"ImageFailStateError", "Cause"=>"ImageFailed"}
This is looking great
Did you want to try and find an Error
from https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html or are you thinking custom Error
is good for now?
This is so much better than the timeout PR I put together
I didn't want to just use the generic "States.TaskFailed" or "States.Runtime" since that is pretty broad and wouldn't allow people to handle these errors specifically, also we're good to define our own errors as long as they don't collide with the States.
namespace
States can report errors with other names. However, these error names can't begin with the States. prefix.
Thought Cause
has the purpose of helping people understand what is wrong.
Thought Cause has the purpose of helping people understand what is wrong.
It does, Error is equivalent to the exception class and Cause is the message. Did I screw that up somewhere? https://github.com/ManageIQ/floe/pull/135/files#diff-15bec43924080ff9b4e1850b9f3e3864476abea254de8b717a98816f856cb5a0R89
When a kubernetes pod fails to pull an image the pod stays in "Pending" indefinitely. This causes floe to "hang" waiting for the pod to start but it never will.