digitalocean / clusterlint

A best practices checker for Kubernetes clusters. 🤠
Apache License 2.0
547 stars 45 forks source link

Runtime error over latest tag on cluster #71

Closed joannelynch92 closed 4 years ago

joannelynch92 commented 4 years ago

I got the following error:

$ clusterlint run

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x11862b5]

goroutine 193 [running]:
github.com/digitalocean/clusterlint/vendor/github.com/docker/distribution/reference.WithTag(0x0, 0x0, 0x13f96c7, 0x6, 0x0, 0x1340600, 0x1, 0xc0010ae000)
    /home/joanne/go/src/github.com/digitalocean/clusterlint/vendor/github.com/docker/distribution/reference/reference.go:280 +0x3f5
github.com/digitalocean/clusterlint/vendor/github.com/docker/distribution/reference.TagNameOnly(0x0, 0x0, 0x0, 0x0)
    /home/joanne/go/src/github.com/digitalocean/clusterlint/vendor/github.com/docker/distribution/reference/normalize.go:130 +0xa5
github.com/digitalocean/clusterlint/checks/basic.(*latestTagCheck).checkTags(0x2285958, 0xc000bf5e40, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0xc000ca83e0, 0x18, ...)
    /home/joanne/go/src/github.com/digitalocean/clusterlint/checks/basic/latest_tag.go:70 +0x164
github.com/digitalocean/clusterlint/checks/basic.(*latestTagCheck).Run(0x2285958, 0xc000517280, 0x2268080, 0x1010000015c58c0, 0x226d180, 0xc0000e86e8, 0x44a86b)
    /home/joanne/go/src/github.com/digitalocean/clusterlint/checks/basic/latest_tag.go:57 +0x14f
github.com/digitalocean/clusterlint/checks.Run.func1(0x8, 0x148ace0)
    /home/joanne/go/src/github.com/digitalocean/clusterlint/checks/run_checks.go:51 +0xc3
github.com/digitalocean/clusterlint/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1(0xc0004c9c20, 0xc000519180)
    /home/joanne/go/src/github.com/digitalocean/clusterlint/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x57
created by github.com/digitalocean/clusterlint/vendor/golang.org/x/sync/errgroup.(*Group).Go
    /home/joanne/go/src/github.com/digitalocean/clusterlint/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66

It looks like clusterlint errored on a latest tag because clusterlint run ignore-checks latest-tag ran successfully.

The problem looks like it occurs because of a pod on my cluster that refers to a latest tag:

apiVersion: v1
kind: Pod
  creationTimestamp: "2019-10-24T08:47:41Z"
  generateName: jaeger-698f8b8cf4-
  labels:
    app: jaeger
    app.kubernetes.io/component: all-in-one
    app.kubernetes.io/name: jaeger
    pod-template-hash: 698f8b8cf4
  name: jaeger-698f8b8cf4-nmjcg
...
...
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-10-24T08:47:41Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-10-24T08:47:59Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-10-24T08:47:59Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-10-24T08:47:41Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://1e012a3f85a056a0674877ce93fdb4ad54bc6a6151e58611f7058739f270cab0
    image: jaegertracing/all-in-one:latest
    imageID: docker-pullable://jaegertracing/all-in-one@sha256:4cb2598b80d4f37b1d66fbe35b2f7488fa04f4d269e301919e8c45526f2d73c3
    lastState: {}
    name: jaeger
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: "2019-10-24T08:47:45Z"
  hostIP: 192.168.176.31
  phase: Running
  podIP: 192.168.181.143
  qosClass: Burstable
  startTime: "2019-10-24T08:47:41Z"

See status.containerStatuses.containerID.image for where the problem is.

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
adamwg commented 4 years ago

Thanks for the report! Looks like ParseNormalizedNamed is returning an error, which we're ignoring. I'll have to dig a bit to see why we're ignoring the error and what exactly it's returning. I'll try to reproduce the problem and then fix it.

adamwg commented 4 years ago

@joannelynch92 I created a pod in my own cluster using the jaegertracing/all-in-one:latest image, and was able to run clusterlint successfully:

% clusterlint run -c latest-tag
[warning] default/pod/jaeger: Avoid using latest tag for container 'jaeger'

This makes me think that pod/container is not actually the one causing trouble in your case - but clearly some pod is causing us to crash. I've created PR #72 to add a check for the error we're ignoring; if you're able to build a clusterlint binary from that PR and run it on your cluster, I'd be interested to see the output of kubectl get -o yaml for any pod that gets the new Image name for container '<name>' could not be parsed warning.

If you're not able to build your own clusterlint from that branch, feel free to wait until we merge the PR and do a release, then give it a try.

As a bit of background: we were ignoring the error from reference.ParseNormalizedNamed because that's the same function k8s itself calls to parse image names when you deploy a workload, so we expect images in a running workload would always have a name the function can parse. It seems like there's something running in your cluster that has an image name that doesn't parse successfully; I'm very curious what that might be :-).

joannelynch92 commented 4 years ago

So someone fixed the problem pod on the cluster over the weekend but I had the output of all the cluster's pods saved and noticed one image was missing its tag.

$ kubectl get pod hello-release-hello-world-76bc67557d-4g565 -o yaml

apiVersion: v1
kind: Pod
spec:
  containers:
  - image: 'redacted.dkr.ecr.us-east-1.amazonaws.com/redacted/redacted:'
status:
  containerStatuses:
  - image: 'redacted.dkr.ecr.us-east-1.amazonaws.com/redacted/redacted:'
    imageID: ""
    lastState: {}
    name: hello-world
    ready: false
    restartCount: 0
    state:
      waiting:
        message: 'Failed to apply default image tag "redacted.dkr.ecr.us-east-1.amazonaws.com/redacted/redacted:":
          couldn''t parse image reference "redacted.dkr.ecr.us-east-1.amazonaws.com/redacted/redacted:":
          invalid reference format'
        reason: InvalidImageName

I put the broken image back in as a test and the clusterlint program crashed again (v0.1.3, not with changes) so looks like that was the image name at fault. Thanks for your help!

adamwg commented 4 years ago

@joannelynch92 Ah, thanks for the update - I am able to reproduce the problem with a missing tag. #72 fixes the issue.

adamwg commented 4 years ago

Closing this since it was fixed by #72.