ManageIQ / floe

Floe is a runner for Amazon States Language workflows
Apache License 2.0
0 stars 5 forks source link

Handle ImagePullErr/ImagePullBackOff as errors #135

Closed agrare closed 10 months ago

agrare commented 11 months ago

When a kubernetes pod fails to pull an image the pod stays in "Pending" indefinitely. This causes floe to "hang" waiting for the pod to start but it never will.

Fryguy commented 11 months ago

@agrare Just for my clarity, what causes each of those error states?

agrare commented 11 months ago

@Fryguy I reproduced by just typo'ing an image name, first you get ErrImagePull then after a period of time the pod will be in ImagePullBackOff so really the same root cause can be two different errors depending on when we check.

kbrock commented 11 months ago

@agrare could you add a test that shows the output will end up as State.Timeout or what ever

Fryguy commented 11 months ago

Not sure I understand how this works...does the fact that it's no longer running? trigger some next step to say it's an error?

agrare commented 11 months ago

Not sure I understand how this works...does the fact that it's no longer running? trigger some next step to say it's an error?

The problem with these "failed" containers is that the pod is Pending, we were considering Pending to be a running state since that is the same state a pod which just hasn't started yet is in. This was causing failed pods to "hang" since we were waiting for kubernetes to start them.

Once a pod is no longer running we then check success? and get the output, so by marking these failed pods as not running we allow it to go through the normal process for a completed pod.

kbrock commented 11 months ago

The workflow will probably want to display a reasonable cause. Not sure how we can expose this unless we tweak output to put in this error?

agrare commented 11 months ago

@kbrock I've updated output to return a properly formatted error with "Error" and "Cause" keys

Here's an example of a failed workflow's output

{"Error"=>"ErrImagePull", "Cause"=>"rpc error: code = Unknown desc = failed to pull and unpack image \"docker.io/agrare/hello-worl:latest\": failed to resolve reference \"docker.io/agrare/hello-worl:latest\": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed"}

And I can add Fail state matchers:

    "NextState": {
      "Type": "Task",
      "Resource": "docker://docker.io/agrare/hello-worl:latest",
      "Secrets": ["vmdb:aaa-bbb-ccc"],
      "End": true,
      "Catch": [
        {
          "ErrorEquals": [ "ErrImagePull", "ImagePullBackOff" ],
          "Next": "ImageFailState"
        }
      ]
    },

    "ImageFailState": {
      "Type": "Fail",
      "Error": "ImageFailStateError",
      "Cause": "ImageFailed"
    }
Running state: [ImageFailState]
{"Error"=>"ImageFailStateError", "Cause"=>"ImageFailed"}
kbrock commented 11 months ago

This is looking great

Did you want to try and find an Error from https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html or are you thinking custom Error is good for now?

This is so much better than the timeout PR I put together

agrare commented 11 months ago

I didn't want to just use the generic "States.TaskFailed" or "States.Runtime" since that is pretty broad and wouldn't allow people to handle these errors specifically, also we're good to define our own errors as long as they don't collide with the States. namespace

States can report errors with other names. However, these error names can't begin with the States. prefix.

kbrock commented 10 months ago

Thought Cause has the purpose of helping people understand what is wrong.

miq-bot commented 10 months ago

Checked commits https://github.com/agrare/floe/compare/8caa310111b085ca1d73fbc960d1b84c2138f5ac~...5848f159f51df71b79ef86caaaa59dbc12f44218 with ruby 2.6.10, rubocop 1.28.2, haml-lint 0.35.0, and yamllint 3 files checked, 0 offenses detected Everything looks fine. :cake:

agrare commented 10 months ago

Thought Cause has the purpose of helping people understand what is wrong.

It does, Error is equivalent to the exception class and Cause is the message. Did I screw that up somewhere? https://github.com/ManageIQ/floe/pull/135/files#diff-15bec43924080ff9b4e1850b9f3e3864476abea254de8b717a98816f856cb5a0R89