GoogleCloudPlatform / testgrid

Apache License 2.0
193 stars 68 forks source link

Add alerts to catch Knative TestGrid pods not running #1066

Open michelle192837 opened 1 year ago

michelle192837 commented 1 year ago

Stuck in CrashLoopBackoff due to permissions issue reading the config, e.g.:

jsonPayload: {
error: "observe config: can't read "gs://knative-own-testgrid/config": open: Get "https://storage.googleapis.com/knative-own-testgrid/config": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
    https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

`"
file: "cmd/summarizer/main.go:151"
func: "main.main"
level: "error"
msg: "Could not summarize"
}

I ran https://github.com/GoogleCloudPlatform/testgrid/blob/master/cluster/bind-service-accounts.sh to see if any of the SAs need to be re-bound, and it seems like the answer was 'yes':

./bind-service-accounts.sh
Service accounts:
./canary/api.yaml:    iam.gke.io/gcp-service-account: testgrid-canary-api@k8s-testgrid.iam.gserviceaccount.com
./canary/api.yaml:  namespace: testgrid-canary
./canary/api.yaml:      serviceAccountName: api
./canary/config_merger.yaml:    iam.gke.io/gcp-service-account: testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
./canary/config_merger.yaml:  namespace: testgrid-canary
./canary/config_merger.yaml:      serviceAccountName: config-merger
./canary/monitoring.yaml:  namespace: testgrid-canary
./canary/summarizer.yaml:    iam.gke.io/gcp-service-account: testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
./canary/summarizer.yaml:  namespace: testgrid-canary
./canary/summarizer.yaml:      serviceAccountName: summarizer
./canary/tabulator.yaml:    iam.gke.io/gcp-service-account: testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
./canary/tabulator.yaml:  namespace: testgrid-canary
./canary/tabulator.yaml:      serviceAccountName: tabulator
./canary/updater.yaml:    iam.gke.io/gcp-service-account: testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
./canary/updater.yaml:  namespace: testgrid-canary
./canary/updater.yaml:      serviceAccountName: updater
./prod/config_merger.yaml:    iam.gke.io/gcp-service-account: updater@k8s-testgrid.iam.gserviceaccount.com
./prod/config_merger.yaml:  namespace: testgrid
./prod/config_merger.yaml:      serviceAccountName: config-merger
./prod/knative/summarizer.yaml:    iam.gke.io/gcp-service-account: testgrid-updater@knative-tests.iam.gserviceaccount.com
./prod/knative/summarizer.yaml:  namespace: knative
./prod/knative/summarizer.yaml:      serviceAccountName: summarizer
./prod/knative/tabulator.yaml:    iam.gke.io/gcp-service-account: testgrid-updater@knative-tests.iam.gserviceaccount.com
./prod/knative/tabulator.yaml:  namespace: knative
./prod/knative/tabulator.yaml:      serviceAccountName: tabulator
./prod/knative/updater.yaml:    iam.gke.io/gcp-service-account: testgrid-updater@knative-tests.iam.gserviceaccount.com
./prod/knative/updater.yaml:  namespace: knative
./prod/knative/updater.yaml:      serviceAccountName: updater
./prod/monitoring.yaml:  namespace: testgrid
./prod/README.md:1. Bind the service account(s) for the component in the `testgrid-canary` namespace:
./prod/README.md:1. Bind the service account(s) for the component in the `testgrid` namespace:
./prod/summarizer.yaml:    iam.gke.io/gcp-service-account: updater@k8s-testgrid.iam.gserviceaccount.com
./prod/summarizer.yaml:  namespace: testgrid
./prod/summarizer.yaml:      serviceAccountName: summarizer
./prod/tabulator.yaml:    iam.gke.io/gcp-service-account: updater@k8s-testgrid.iam.gserviceaccount.com
./prod/tabulator.yaml:  namespace: testgrid
./prod/tabulator.yaml:      serviceAccountName: tabulator
./prod/updater.yaml:    iam.gke.io/gcp-service-account: updater@k8s-testgrid.iam.gserviceaccount.com
./prod/updater.yaml:  namespace: testgrid
./prod/updater.yaml:      serviceAccountName: updater
./setup.sh:echo -n 'testgrid namespace: ' >&2
NOOP: testgrid-canary/config-merger has workloadIdentityUser access to testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
NOOP: testgrid-canary/summarizer has workloadIdentityUser access to testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
NOOP: testgrid-canary/tabulator has workloadIdentityUser access to testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
NOOP: testgrid-canary/updater has workloadIdentityUser access to testgrid-canary@k8s-testgrid.iam.gserviceaccount.com
serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater] in serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
Grant serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer] roles/iam.workloadIdentityUser access to testgrid-updater@knative-tests.iam.gserviceaccount.com? [y/N] y
+ /usr/bin/gcloud iam service-accounts --project knative-tests add-iam-policy-binding testgrid-updater@knative-tests.iam.gserviceaccount.com --role roles/iam.workloadIdentityUser --member 'serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]'
Updated IAM policy for serviceAccount [testgrid-updater@knative-tests.iam.gserviceaccount.com].
bindings:
- members:
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
  - serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater]
  role: roles/iam.workloadIdentityUser
etag: BwXq2u1cNwo=
version: 1
DONE: gave knative/summarizer workloadIdentityUser access to testgrid-updater@knative-tests.iam.gserviceaccount.com
serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater] in serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]
Grant serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator] roles/iam.workloadIdentityUser access to testgrid-updater@knative-tests.iam.gserviceaccount.com? [y/N] y
+ /usr/bin/gcloud iam service-accounts --project knative-tests add-iam-policy-binding testgrid-updater@knative-tests.iam.gserviceaccount.com --role roles/iam.workloadIdentityUser --member 'serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]'
Updated IAM policy for serviceAccount [testgrid-updater@knative-tests.iam.gserviceaccount.com].
bindings:
- members:
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]
  - serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater]
  role: roles/iam.workloadIdentityUser
etag: BwXq2u2Rpkc=
version: 1
DONE: gave knative/tabulator workloadIdentityUser access to testgrid-updater@knative-tests.iam.gserviceaccount.com
serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater] in serviceAccount:k8s-testgrid.svc.id.goog[knative/updater]
Grant serviceAccount:k8s-testgrid.svc.id.goog[knative/updater] roles/iam.workloadIdentityUser access to testgrid-updater@knative-tests.iam.gserviceaccount.com? [y/N] y
+ /usr/bin/gcloud iam service-accounts --project knative-tests add-iam-policy-binding testgrid-updater@knative-tests.iam.gserviceaccount.com --role roles/iam.workloadIdentityUser --member 'serviceAccount:k8s-testgrid.svc.id.goog[knative/updater]'
Updated IAM policy for serviceAccount [testgrid-updater@knative-tests.iam.gserviceaccount.com].
bindings:
- members:
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/summarizer]
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/tabulator]
  - serviceAccount:k8s-testgrid.svc.id.goog[knative/updater]
  - serviceAccount:knative-tests.svc.id.goog[test-pods/testgrid-updater]
  role: roles/iam.workloadIdentityUser
etag: BwXq2u4Lseg=
version: 1
DONE: gave knative/updater workloadIdentityUser access to testgrid-updater@knative-tests.iam.gserviceaccount.com
NOOP: testgrid/config-merger has workloadIdentityUser access to updater@k8s-testgrid.iam.gserviceaccount.com
NOOP: testgrid/summarizer has workloadIdentityUser access to updater@k8s-testgrid.iam.gserviceaccount.com
NOOP: testgrid/tabulator has workloadIdentityUser access to updater@k8s-testgrid.iam.gserviceaccount.com
NOOP: testgrid/updater has workloadIdentityUser access to updater@k8s-testgrid.iam.gserviceaccount.com
michelle192837 commented 1 year ago

It looks like the pods are able to start now! Remaining tasks: