aquasecurity / starboard

Moved to https://github.com/aquasecurity/trivy-operator
https://aquasecurity.github.io/starboard/
Apache License 2.0
1.35k stars 197 forks source link

Operator CrashLoop in EKS 1.21 #856

Closed DaemonDude23 closed 2 years ago

DaemonDude23 commented 2 years ago

What steps did you take and what happened:

Deployed Helm Chart version 0.8.1 with default values, with starboard operator version 0.13.0 and 0.13.1, among a few previous versions. On any AWS EKS cluster I try to run the operator on that is version 1.21, the operator CrashLoops and throws this error:

{"level":"info","ts":1639588607.0615091,"logger":"main","msg":"Starting operator","buildInfo":{"Version":"0.13.1","Commit":"e9cd6e1467f942ce114468f4d30012bd4256fa9c","Date":"2021-12-01T14:31:52Z","Executabl
e":""}}
{"level":"info","ts":1639588607.0643575,"logger":"operator","msg":"Resolved install mode","install mode":"OwnNamespace","operator namespace":"starboard","target namespaces":}
{"level":"info","ts":1639588607.0653288,"logger":"operator","msg":"Constructing client cache","namespace":"starboard"}
{"level":"error","ts":1639588607.0655043,"logger":"main","msg":"Unable to run starboard operator","error":"getting kube client config: invalid configuration: no configuration has been provided, try setting
KUBERNETES_MASTER environment variable"}

What did you expect to happen:

The Operator to start, become healthy, and begin scans.

Anything else you would like to add:

Environment:

nublarsec commented 2 years ago

As another data point: we've got the Starboard Operator working successfully on EKS 1.21, using default settings, but with slightly later versions of Starboard and the Helm chart.

This is a known working combination:

@DaemonDude23 have you tried it with more recent versions?

DaemonDude23 commented 2 years ago

I've tested it after 5-10 releases following my filing of this issue. All the same result, including today once I saw your message. I updated to the latest chart and starboard version, but the same error persists. Side-by-side diffed the latest values against mine. The only diffs I have are for resources and podAnnotations so nothing that should cause this kind of error. Had the same error on a bare-metal pure k8s homelab as well as a k3s cluster. Surely others would run into this problem, but it seems not and I'm the only outlier.

nublarsec commented 2 years ago

Digging a bit further based on your error message:

getting kube client config: invalid configuration: no configuration has been provided, try setting
KUBERNETES_MASTER environment variable

As part of pkg/operator/operator.go:

kubeConfig, err := ctrl.GetConfig()
if err != nil {
    return fmt.Errorf("getting kube client config: %w", err)
}

So it's using the Controller Runtime to find a kubeconfig it can use to connect to the K8s API. Usually when running inside the cluster you don't have to set anything, it will use the in-cluster config and the service account token.

You can see the order of precedence in controller-runtime/config.go:

// GetConfig creates a *rest.Config for talking to a Kubernetes API server.
// If --kubeconfig is set, will use the kubeconfig file at that location.  Otherwise will assume running
// in cluster and use the cluster provided kubeconfig.
//
// It also applies saner defaults for QPS and burst based on the Kubernetes
// controller manager defaults (20 QPS, 30 burst)
//
// Config precedence
//
// * --kubeconfig flag pointing at a file
//
// * KUBECONFIG environment variable pointing at a file
//
// * In-cluster config if running in cluster
//
// * $HOME/.kube/config if exists.

So a couple of possible thoughts:

nublarsec commented 2 years ago

A few more references down this line of thinking:

DaemonDude23 commented 2 years ago

Thanks for all the info. I found my problem (100% user error). In a manifest that I was using kustomize to patch automountServiceAccountToken: false on the deployment. I thought that would disable auto-mounting of the default service account (which wouldn't be applicable as we're not using the default in-cluster one anyway), not the one explicitly assigned to the pod(s). I'll do some testing with it tomorrow and will likely close this issue then.

Thanks for jogging my brain to look at service account token mounting! I completely forgot I was using a patch.