GoogleCloudPlatform / k8s-stackdriver

Apache License 2.0
390 stars 212 forks source link

Support workload identity #315

Open matthias-froomle opened 4 years ago

matthias-froomle commented 4 years ago

When deploying the Stackdriver custom metrics adapter inside a GKE cluster with workload identity enabled, the adaptor (v0.10.2) fails to start.

Steps taken:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

Adapter deployment log:

"unable to construct client config: unable to construct lister client config to initialize provider: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory" 
source: "adapter.go:55" 
pdecat commented 4 years ago

Hi,

I too had trouble with CMSA and GKE Workload Identity on GKE v1.15.7-gke.23.

The error messages at startup differed though:

I0310 16:01:28.490640       1 serving.go:312] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0310 16:01:30.116329       1 secure_serving.go:116] Serving securely on [::]:443
E0310 16:01:32.987569       1 provider.go:241] Failed request to stackdriver api: Get https://monitoring.googleapis.com/v3/projects/myproject-preprod/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22myproject-preprod%22+AND+resource.labels.cluster_name+%3D+%22myproject-preprod-europe-west1-gke1%22+AND+resource.labels.location+%3D+%22europe-west1%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%29&prettyPrint=false: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fmonitoring.read: net/http: timeout awaiting response headers

The log would then be spammed by:

E0310 16:01:35.432807       1 provider.go:241] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

I've managed to make it work with GKE Workload Identity by adding hostNetwork: true to the deployment's spec.

It works because of the following documented limitation:

Workload Identity can't be used with Pods running in the host network.

See https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#limitations

serathius commented 4 years ago

/cc @kawych

davidxia commented 4 years ago

@pdecat, in your case, running CMSA on a node with Workload Identity (WI) enabled broke it probably because the Google Service Account (GSA) associated with WI that CMSA was running as didn't have the roles/monitoring.viewer role on the relevant GCP projects that hold the metrics CMSA is trying to query.

When you changed CMSA to run on host network mode, it probably worked because now CMSA is using the GKE node's default GSA which is different than the WI-related GSA. This GKE node default GSA probably has roles/monitoring.viewer role or at least those permissions to query the metrics.

davidxia commented 4 years ago

Workload Identity with CMSA 0.10.2 seems to work for me. I'm seeing these logs which are the same as the ones when it wasn't using WI.

I0404 01:50:34.583939       1 serving.go:312] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0404 01:50:37.188933       1 secure_serving.go:116] Serving securely on [::]:443
E0404 01:50:41.273390       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0404 01:50:41.273463       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
pdecat commented 4 years ago

@davidxia do you have Horizontal Pod Autoscalers based on external Stackdriver metrics?

davidxia commented 4 years ago

Yes

varungbt commented 4 years ago

Seeing the same issue.

JacobSMoller commented 4 years ago

Seems to work for me as well using workload identity.

Getting never ending stream of these logs though

1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E0430 15:12:25.531660 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}

davidxia commented 4 years ago

Same, would be great if these could be silenced or moved to a lower logging level.

cromaniuc commented 4 years ago

@davidxia, @JacobSMoller What role are you using for Google Service Account associated with the Workload Identity that CMSA is running as? I'm using roles/monitoring.admin and it fails with 403. When I'm using hostNetwork: true it works. Thanks!

davidxia commented 4 years ago

roles/monitoring.viewer

JacobSMoller commented 4 years ago

roles/monitoring.admin

LouisTrezzini commented 4 years ago

We're facing the same issue Workload identity works fine for us in every single deployment except this one, so we're guessing there's something going on here

aubm commented 4 years ago

I managed to make it work with WI using the following approach :

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics
viniciusccarvalho commented 4 years ago

I have the same issue. @aubm steps do not work either. It will fail with WI on the adapter with errors 2020-08-03 17:26:32.524 EDT Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden The annotated service account does point to the right GSA, but this simply will not work as expected.

aubm commented 4 years ago

@viniciusccarvalho did you try kubectl delete pods -all -n custom-metrics after running my previous commands?

viniciusccarvalho commented 4 years ago

Yes, I deleted everything even the namespace, still won't work. Running 1.16.11-gke.5 on my cluster. Still no luck

apurvc commented 4 years ago

I managed to make it work with WI using the following approach :

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

This worked for me I am using 1.15.12-gke.2

stevenarvar commented 3 years ago

Running 1.17.14-gke.1600. Ran into this issue. I followed the steps describe in the README: https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md

The instruction is the same as @aubm. Actually, the first time I configured it, CMSA works with WI. Then I needed to replace the GSA so I re-annotated the K8S service account. My new GSA has the same roles as the original working GSA. I don't see any miss configuration. The new config is outputting these errors:

E0209 19:19:11.124054       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0209 19:19:11.220395       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0209 19:19:11.316925       1 provider.go:270] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
stevenarvar commented 3 years ago

After wait a while, I do see my CMSA and WI working fine. Not sure why GSA/WI/CMSA does not work right away. May be took some time for GCP to sync up.

jharshman commented 3 years ago

There does appear to be a few issues here.

One being that the GKE Metadata Service does not support all of the same endpoints that the GCE metadata service does. So if you don't run this workload with host networking enabled in a cluster with Workload Identity enabled, it fails immediately and gets thrown into a crashloop with the following error: Failed to get GCE config: error while getting instance (node) name: metadata: GCE metadata "instance/name" not defined

This makes sense seeing that there is no instance/name endpoint on the GKE metadata service. Supported endpoints are:

attributes/
hostname
id
service-accounts/
zone

The second issue being that when not directly using Workload Identity, and instead setting GOOGLE_APPLICATION_CREDENTIALS environment variable with a service account JSON mounted to the pod, authentication starts to fail.

W0210 00:09:53.360350       1 stackdriver.go:91] Error while fetching metric descriptors for kube-state-metrics: Get https://monitoring.googleapis.com/v3/projects/REDACTED/metricDescriptors?alt=json&filter=metric.type+%3D+starts_with%28%22custom.googleapis.com%2Fkube-state-metrics%22%29&prettyPrint=false: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_scope","error_description":"Invalid OAuth scope or ID token audience provided."}

The prometheus-to-sd daemonset that comes with GKE as part of the core tooling, appears to use host-networking to bypass the GKE metadata server and use the GCE metadata server.

If there are issues with this project functioning with Workload Identity enabled or with host-networking mode turned off perhaps some documentation would help.

AnthonMS commented 2 years ago
gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

I have finally gotten passed the stage where it says permission denied, by cleaning up all the services and other stuff the other adapter yaml config creates. I have run these commands to create the service account, bind the correct roles and create the services and deployment.

It does however, look like I am getting a new error and I will post the logs below. Ideally I would like to scale based on fpm metrics like in the example here. But I got the permission denied in the logs after applying that adapter.yaml and I was also getting a NaN error in the prometheous-to-sd container. But that's for another day.

I thought if I applyed this adapter to my google project/GKE cluster? Then I would be able to scale based on request_per_second to the pods? Something like in the first example in this custom metrics adapter.

E0707 08:13:38.145404       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="c85dd773-5209-4482-b579-453fb45609cc"
E0707 08:13:38.145544       1 timeout.go:135] post-timeout activity - time-elapsed: 4.57µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:13:38.145943       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:13:38.146055       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="1f67ea8f-7156-4051-a592-eb2e1d6f784a"
E0707 08:13:38.146170       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="22365708-84de-47d9-83c4-f3a65e1ec541"
E0707 08:13:38.146237       1 timeout.go:135] post-timeout activity - time-elapsed: 3.409µs, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:13:38.147755       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:13:38.151491       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:13:38.157277       1 timeout.go:135] post-timeout activity - time-elapsed: 105.423502ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:13:38.158473       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:13:38.159764       1 timeout.go:135] post-timeout activity - time-elapsed: 13.63787ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:07.944406       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.944539       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:07.944614       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="5b892969-1535-4b4c-9bcf-098b41715a3f"
E0707 08:14:07.950119       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:07.950496       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.951429       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.951550       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="81a6e59c-c9fd-4928-9970-3139892733df"
E0707 08:14:07.951884       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:07.955355       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="51b15586-d512-46ab-8043-96f15ba83db0"
E0707 08:14:07.955692       1 timeout.go:135] post-timeout activity - time-elapsed: 10.996794ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:07.956311       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b3f23a0f-7a84-4fe0-93ed-11c74277106e"
E0707 08:14:07.956580       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:07.957598       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.046439       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.052644       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.053480       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.054590       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.057312       1 timeout.go:135] post-timeout activity - time-elapsed: 101.863564ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:08.144599       1 timeout.go:135] post-timeout activity - time-elapsed: 192.981022ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:08.145462       1 timeout.go:135] post-timeout activity - time-elapsed: 189.04884ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:08.146271       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:08.146469       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="33c48cb9-9f1d-4ed3-99f7-bb6b56f21a33"
E0707 08:14:08.150336       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:08.152789       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:08.154572       1 timeout.go:135] post-timeout activity - time-elapsed: 7.988662ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:37.747766       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E0707 08:14:37.852388       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="4ea88186-56b4-47a4-b73d-6a680fa8ce2f"
E0707 08:14:37.853284       1 timeout.go:135] post-timeout activity - time-elapsed: 13.497µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:38.046613       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:38.047320       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:38.047827       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="d12d9ec7-92c8-459d-ab49-f9953abda738"
E0707 08:14:38.054379       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0707 08:14:38.054542       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="9d28f636-0bb0-495a-9526-c81ebb4d0c98"
E0707 08:14:38.145771       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:38.147100       1 timeout.go:135] post-timeout activity - time-elapsed: 98.88436ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0707 08:14:38.152042       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0707 08:14:38.155348       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0707 08:14:38.156855       1 timeout.go:135] post-timeout activity - time-elapsed: 102.231415ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0707 08:14:38.158736       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="3bf9eb85-5fb1-4d18-8ec8-988ce1eca65b"
E0707 08:14:38.158967       1 writers.go:111] apiserver was unable to close cleanly the response writer: http: Handler timeout
E0707 08:14:38.161342       1 timeout.go:135] post-timeout activity - time-elapsed: 2.336539ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
msathe-tech commented 2 years ago

I tried all known tricks in the book, created ns ahead of time, create the needed K8s SA (custom-metrics-stackdriver-adapter), annotated K8s SA with GCP SA, the GCP SA already has a Moniotoring Editor role, and then used https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I still get following error - 
E0822 20:17:27.104881       1 provider.go:271] Failed request to stackdriver api: Get "https://monitoring.googleapis.com/v3/projects/<project>/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22prj-gke-mt-spike%22+AND+resource.labels.cluster_name+%3D+%22<cluster>%22+AND+resource.labels.location+%3D+%22<c cluster-zone>%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%2C%22k8s_container%22%29&prettyPrint=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
        https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to
msathe-tech commented 2 years ago

My GKE version is 1.24.3-gke.200

Without this feature the HPA simply doesn't work with custom metric. You need to rely on a SA key download which is against a security best practice.

eahrend commented 2 years ago

Hey, I'm getting this on 1.22.12-gke.300 as well, trying to use WIF.

I'm also getting this error:

E0927 18:27:46.752161       1 timeout.go:135] post-timeout activity - time-elapsed: 22.306873ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0927 18:27:46.755417       1 timeout.go:135] post-timeout activity - time-elapsed: 25.534497ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

However when I run $ kubectl proxy --port=8080 and go to http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta2 and http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta1 the response is not nil and happens almost instantaneously.

red8888 commented 1 year ago

Can you confirm what service account is used by default by the metrics adapter? Is it the node's GSA?

I assign my own GSA to the node pools:

resource "google_container_node_pool" "mypool" {
  name       = "sdfsdfsdf"
  cluster    = google_container_cluster.cluster.name
  .....
  node_config {
    machine_type = "e2-highmem-4"
    // Assign a service account
    service_account = google_service_account.node-pool.email

Can you confirm that this GSA will need access? I don't need the adapter deployment to use workload id itself. Im ok with giving the node pool account this access.

I granted my node pool GSA the Monitoring Viewer role but still seeing this error in the deployment: Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

@pdecat, in your case, running CMSA on a node with Workload Identity (WI) enabled broke it probably because the Google Service Account (GSA) associated with WI that CMSA was running as didn't have the roles/monitoring.viewer role on the relevant GCP projects that hold the metrics CMSA is trying to query.

When you changed CMSA to run on host network mode, it probably worked because now CMSA is using the GKE node's default GSA which is different than the WI-related GSA. This GKE node default GSA probably has roles/monitoring.viewer role or at least those permissions to query the metrics.

red8888 commented 1 year ago

I tried all known tricks in the book, created ns ahead of time, create the needed K8s SA (custom-metrics-stackdriver-adapter), annotated K8s SA with GCP SA, the GCP SA already has a Moniotoring Editor role, and then used https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

I still get following error - 
E0822 20:17:27.104881       1 provider.go:271] Failed request to stackdriver api: Get "https://monitoring.googleapis.com/v3/projects/<project>/metricDescriptors?alt=json&filter=resource.labels.project_id+%3D+%22prj-gke-mt-spike%22+AND+resource.labels.cluster_name+%3D+%22<cluster>%22+AND+resource.labels.location+%3D+%22<c cluster-zone>%22+AND+resource.type+%3D+one_of%28%22k8s_pod%22%2C%22k8s_node%22%2C%22k8s_container%22%29&prettyPrint=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
        https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

Sounds like your missing this piece

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"
iamhritik commented 1 year ago

I'm also getting the same error in custom-stackdriver pod Any new update ???

E0214 16:25:51.059563       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="de9aad6d-13d9-4f88-86fc-ef73c7eb568f"
E0214 16:25:51.059907       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="9ac55f68-c490-4953-b32e-d775adfb056d"
E0214 16:25:51.060088       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="5f791624-f01f-45c7-8d8f-76e43bf56a9c"

However when I run $ kubectl proxy --port=8080 and go to http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta2 and http://127.0.0.1:8080/apis/custom.metrics.k8s.io/v1beta1 the response is not nil and happens almost instantaneously.

nguyen-viet-hung commented 1 year ago

I have the same error as above:

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.

Does anybody have solutions? My cluster version is: v1.24.9-gke.2000

perrornet commented 1 year ago

I have the same error as above:↳

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.↳

Does anybody have solutions? My cluster version is: v1.24.9-gke.2000↳

I have also encountered this situation.

E0227 08:50:42.495351       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b63c2246-27c1-4ece-a779-e552782f1dcd"
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0227 08:50:42.523800       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="67d82f5f-5360-404b-b688-9639d9a89a88"
E0227 08:50:42.523807       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0227 08:50:42.526275       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0227 08:50:42.527535       1 timeout.go:135] post-timeout activity - time-elapsed: 32.12511ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

My cluster version is: v1.25.5-gke.2000

Wazbat commented 1 year ago

Shame this adapter isn't included by default. Struggling to resolve this on my end

Permission errors with workload identity, and even with hostNetwork: true I start to get these errors

post-timeout activity - time-elapsed: 12.553843ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
maxpain commented 1 year ago

Any updates?

MikSFG commented 1 year ago

I have the same error as above:↳

E0221 03:19:29.283344       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0221 03:19:29.283406       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="927bf416-71ba-4845-880a-8ba80c0b044d"
E0221 03:19:29.283529       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="898079ad-6b51-45e9-b3ec-b723312ba8ff"

When I do kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq the response is not nil and happens almost instantaneously.↳ Does anybody have solutions? My cluster version is: v1.24.9-gke.2000↳

I have also encountered this situation.

E0227 08:50:42.495351       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b63c2246-27c1-4ece-a779-e552782f1dcd"
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0227 08:50:42.523800       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="67d82f5f-5360-404b-b688-9639d9a89a88"
E0227 08:50:42.523807       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0227 08:50:42.523764       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0227 08:50:42.526275       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0227 08:50:42.527535       1 timeout.go:135] post-timeout activity - time-elapsed: 32.12511ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

My cluster version is: v1.25.5-gke.2000

Happens to me as well, kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq works ok, and gives normal output.

IanKnighton commented 1 year ago

We're trying to setup the custom metrics so we can scale pods off of messages in a pub/sub queue.

Currently can't get past this error in the logs for the custom-metrics-stackdriver-adapter pod.

E0912 19:48:53.983876       1 provider.go:320] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden
E0912 19:48:53.984071       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0912 19:48:53.984139       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0912 19:48:53.985476       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0912 19:48:53.986869       1 timeout.go:135] post-timeout activity - time-elapsed: 9m10.432424127s, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

I've tried every combination of service account I can think of, but it appears that all of our nodes and all of our pods have the monitoring.metricDescriptors.list permission at the least.

It's kind of wild to me that it appears that this is all over the place. Also kind of annoying that I just followed the documentation and now we're here.

markhc commented 12 months ago

Same issue on our end. Followed every step in the GCP Readme and the alternative methods here as well, still getting 403 errors.

I can get the 403 errors to be resolved when using hostNetwork: true for the deployment, but then other issues pop-up and the pod enters in a crash loop every couple of seconds with these GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil> errors

GKE Cluster Version 1.24.15-gke.1700


EDIT: Finally managed to get it working. The crash loop I mentioned above was an OOMKilled, so I had to increase the resources for the custom-metrics Deployment.

Final working steps:

  1. Follow @aubm steps and install the adapter.yaml resources NOT THE adapter_new_resource_model.yaml. The new adapter entered a different crash loop for me that I was not able to solve.
  2. Modify that adapter Deployment, adding hostNetwork: true to the spec. This solves the "403 Forbidden" errors.
  3. Increase request/limit resources. See what works for you, but my adapter is frequently reaching ~350Mb, and it only had 200Mb as a limit originally.
ltieman commented 11 months ago

I was still getting 403 exceptions using @aubm 's instructions until I added this:

kind: ClusterRole
metadata:
  name: custom-metrics-permissions
rules:
- apiGroups: [""]
  resources: ["subjectaccessreviews"]
  verbs: ["create"]

---

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: custom-metrics-binding
subjects:
- kind: ServiceAccount
  name: custom-metrics-stackdriver-adapter
  namespace: custom-metrics  # Replace with the appropriate namespace
roleRef:
  kind: ClusterRole
  name: custom-metrics-permissions
  apiGroup: rbac.authorization.k8s.io

once the IAM permissions propagated, it started working

PaulRudin commented 9 months ago

Just to be clear is monitoring.editor necessary? You'd have thought monitoring.viewer is enough - and that's what the README says.

Although it's somewhat academic in my case as I'm getting:

E1219 16:02:17.832184       1 provider.go:320] Failed request to stackdriver api: Get "https://monitoring.go │
│ 2023-12-19T16:02:17.832315130Z This error could be caused by a missing IAM policy binding on the target IAM service account.                │
│ 2023-12-19T16:02:17.832320455Z For more information, refer to the Workload Identity documentation:                                          │
│ 2023-12-19T16:02:17.832323944Z     https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

either way.

PaulRudin commented 9 months ago

After applying the suggestion here, the permissions issue is fixed, but I still get loads of this sort of thing in the logs:

1220 07:24:32.869016       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="9c852465-5ffa-4d77-a441-f801886e29e3"
1220 07:24:32.869108       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E1220 07:24:32.871209       1 timeout.go:135] post-timeout activity - time-elapsed: 102.731383ms, GET "/api/custom.metrics.k8s.io/v1beta1" result:  <nil>                         
E1220 07:24:32.873362       1 timeout.go:135] post-timeout activity - time-elapsed: 103.207074ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result:  <nil>
E1220 07:24:32.874493       1 timeout.go:135] post-timeout activity - time-elapsed: 109.99919ms, GET "/api/custom.metrics.k8s.io/v1beta2" result: <nil>
E1220 07:24:32.876604       1 timeout.go:135] post-timeout activity - time-elapsed: 7.572859ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: < nil>
PaulRudin commented 9 months ago

... and either I was mistaken, or the issue has resurfaced - I still see: this sort of thing:

apiserver received an error that is not an metav1.Status: &googleapi.Error{Code:403, Message:"Permission monitoring.timeSeries.list denied (or the resource may not exist).", Details:[]interface {}(nil), Body:"{\n  \"error\": {\n    \"code\": 403,\n    \"message\": \"Permission monitoring.timeSeries.list denied (or the resource may not exist).\",\n    \"errors\": [\n      {\n        \"message\": \"Permission monitoring.timeSeries.list denied (or the resource may not exist).\",\n        \"domain\": \"global\",\n        \"reason\": \"forbidden\"\n      }\n    ],\n    \"status\": \"PERMISSION_DENIED\"\n  }\n}\n", Header:http.Header(nil), Errors:[]googleapi.ErrorItem{googleapi.ErrorItem{Reason:"forbidden", Message:"Permission monitoring.timeSeries.list denied (or the resource may not exist)."}}}: googleapi: Error 403: Permission monitoring.timeSeries.list denied (or the resource may not exist)., forbidden

even though the relevant service account has the monitoring.viewer role.