GoogleCloudPlatform / k8s-stackdriver

Apache License 2.0
390 stars 212 forks source link

custom-metrics-stackdriver-adapter not working - auth problems? #498

Open rbarberop opened 1 year ago

rbarberop commented 1 year ago

Hi have created a new GCP project and a GKE cluster inside it.

I've followed the instructions in the README...

However Logs Explorer is complaining about it... looks like an authentication problem?

  {
    "insertId": "0zumoy6o5n39r4ul",
    "jsonPayload": {
      "pid": "1",
      "message": "apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http2: stream closed\"}: http2: stream closed"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d",
        "cluster_name": "scheduled-autoscaler",
        "namespace_name": "custom-metrics",
        "container_name": "pod-custom-metrics-stackdriver-adapter",
        "location": "us-central1-f",
        "project_id": "rbarbero-gke-freshness-3"
      }
    },
    "timestamp": "2022-11-23T20:10:54.385935713Z",
    "severity": "ERROR",
    "labels": {
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh",
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter",
      "k8s-pod/pod-template-hash": "848d58c7c7",
      "k8s-pod/kubernetes_io/cluster-service": "true"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "status.go",
      "line": "71"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  },
  {
    "insertId": "7crfx9uy4t13qxty",
    "jsonPayload": {
      "pid": "1",
      "message": "post-timeout activity - time-elapsed: 108.975022ms, GET \"/apis/custom.metrics.k8s.io/v1beta1\" result: <nil>"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "pod-custom-metrics-stackdriver-adapter",
        "cluster_name": "scheduled-autoscaler",
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d",
        "location": "us-central1-f",
        "project_id": "rbarbero-gke-freshness-3",
        "namespace_name": "custom-metrics"
      }
    },
    "timestamp": "2022-11-23T20:10:54.389245350Z",
    "severity": "ERROR",
    "labels": {
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter",
      "k8s-pod/pod-template-hash": "848d58c7c7",
      "k8s-pod/kubernetes_io/cluster-service": "true",
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "timeout.go",
      "line": "135"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  },
  {
    "insertId": "484cm7yu3ea24ke2",
    "jsonPayload": {
      "message": "apiserver was unable to write a fallback JSON response: http2: stream closed",
      "pid": "1"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d",
        "location": "us-central1-f",
        "namespace_name": "custom-metrics",
        "container_name": "pod-custom-metrics-stackdriver-adapter",
        "project_id": "rbarbero-gke-freshness-3",
        "cluster_name": "scheduled-autoscaler"
      }
    },
    "timestamp": "2022-11-23T20:10:54.391067637Z",
    "severity": "ERROR",
    "labels": {
      "k8s-pod/pod-template-hash": "848d58c7c7",
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter",
      "k8s-pod/kubernetes_io/cluster-service": "true",
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "writers.go",
      "line": "130"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  },
  {
    "insertId": "vku7ud5xhug364zm",
    "jsonPayload": {
      "message": "post-timeout activity - time-elapsed: 193.853136ms, GET \"/apis/custom.metrics.k8s.io/v1beta1\" result: <nil>",
      "pid": "1"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "namespace_name": "custom-metrics",
        "location": "us-central1-f",
        "container_name": "pod-custom-metrics-stackdriver-adapter",
        "cluster_name": "scheduled-autoscaler",
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d",
        "project_id": "rbarbero-gke-freshness-3"
      }
    },
    "timestamp": "2022-11-23T20:10:54.486030132Z",
    "severity": "ERROR",
    "labels": {
      "k8s-pod/kubernetes_io/cluster-service": "true",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter",
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh",
      "k8s-pod/pod-template-hash": "848d58c7c7"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "timeout.go",
      "line": "135"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  },
  {
    "insertId": "u55i6pz9rfhil94j",
    "jsonPayload": {
      "message": "timeout or abort while handling: method=GET URI=\"/apis/custom.metrics.k8s.io/v1beta1\" audit-ID=\"3e30bad8-85bc-46e4-9a9b-0680dbfe47f8\"",
      "pid": "1"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "pod-custom-metrics-stackdriver-adapter",
        "location": "us-central1-f",
        "cluster_name": "scheduled-autoscaler",
        "namespace_name": "custom-metrics",
        "project_id": "rbarbero-gke-freshness-3",
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d"
      }
    },
    "timestamp": "2022-11-23T20:10:54.502749250Z",
    "severity": "ERROR",
    "labels": {
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "k8s-pod/pod-template-hash": "848d58c7c7",
      "k8s-pod/kubernetes_io/cluster-service": "true",
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "wrap.go",
      "line": "54"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  },
  {
    "insertId": "foryseqqz48yqfbg",
    "jsonPayload": {
      "pid": "1",
      "message": "apiserver was unable to close cleanly the response writer: http: Handler timeout"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "project_id": "rbarbero-gke-freshness-3",
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d",
        "namespace_name": "custom-metrics",
        "cluster_name": "scheduled-autoscaler",
        "location": "us-central1-f",
        "container_name": "pod-custom-metrics-stackdriver-adapter"
      }
    },
    "timestamp": "2022-11-23T20:10:54.502797138Z",
    "severity": "ERROR",
    "labels": {
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh",
      "k8s-pod/kubernetes_io/cluster-service": "true",
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter",
      "k8s-pod/pod-template-hash": "848d58c7c7"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "writers.go",
      "line": "111"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  },
  {
    "insertId": "ic9b2l6ofwbvvdx0",
    "jsonPayload": {
      "message": "post-timeout activity - time-elapsed: 4.947211ms, GET \"/apis/custom.metrics.k8s.io/v1beta1\" result: <nil>",
      "pid": "1"
    },
    "resource": {
      "type": "k8s_container",
      "labels": {
        "pod_name": "custom-metrics-stackdriver-adapter-848d58c7c7-h4s5d",
        "project_id": "rbarbero-gke-freshness-3",
        "namespace_name": "custom-metrics",
        "location": "us-central1-f",
        "cluster_name": "scheduled-autoscaler",
        "container_name": "pod-custom-metrics-stackdriver-adapter"
      }
    },
    "timestamp": "2022-11-23T20:10:54.502805932Z",
    "severity": "ERROR",
    "labels": {
      "k8s-pod/run": "custom-metrics-stackdriver-adapter",
      "compute.googleapis.com/resource_name": "gke-scheduled-autoscaler-default-pool-28d0994f-85hh",
      "k8s-pod/pod-template-hash": "848d58c7c7",
      "k8s-pod/k8s-app": "custom-metrics-stackdriver-adapter",
      "k8s-pod/kubernetes_io/cluster-service": "true"
    },
    "logName": "projects/rbarbero-gke-freshness-3/logs/stderr",
    "sourceLocation": {
      "file": "timeout.go",
      "line": "135"
    },
    "receiveTimestamp": "2022-11-23T20:10:57.325326887Z"
  }
vschettino-asimov commented 1 year ago

Looks like I have the same issue:

E1205 21:46:49.429171       1 timeout.go:135] post-timeout activity - time-elapsed: 3.304368ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E1205 21:46:49.523578       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="71f72960-a980-4afb-ad1d-03e1b6bec66f"
E1205 21:46:49.523649       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E1205 21:46:49.523716       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="79f73bf9-dc54-4676-936a-aa819a77194e"
E1205 21:46:49.525281       1 writers.go:111] apiserver was unable to close cleanly the response writer: http: Handler timeout
E1205 21:46:49.525327       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E1205 21:46:49.526374       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E1205 21:46:49.528673       1 timeout.go:135] post-timeout activity - time-elapsed: 4.796544ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
sethTrustle commented 1 year ago

I have the same issue post-timeout activity - time-elapsed: 109.781784ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>

wyardley commented 1 year ago

I have a high volume of the same / similar errors That said, the adapter itself works AFAIK

clarkmcc commented 1 year ago

Same issue, also I had to add this resource to my cluster in order to get it to startup

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: custom-metrics-extension-apiserver-authentication-reader
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
PaulRudin commented 1 year ago

Same issue, also I had to add this resource to my cluster in order to get it to startup

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: custom-metrics-extension-apiserver-authentication-reader
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: custom-metrics-stackdriver-adapter
    namespace: custom-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader

I tried this but nothing changed. When you "get it to startup" what exactly does "it" refer to? The stackdriver adapter pod? In my case the pod is running, but it has loads of of log entries like:

E0208 17:56:12.201768 1 provider.go:320] Failed request to stackdriver api: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden

Maybe it's a different issue?

clarkmcc commented 1 year ago

@PaulRudin just spitballing here, but that error looks like your cluster's service account doesn't have permission to call the GCP monitoring API.

PaulRudin commented 1 year ago

Yeah, but nothing has changed recently as far as I know. But maybe I've inadvertently modified something when changing something unrelated.

morphalus commented 1 year ago

Hi!

I have the exactly same issue and it appeared suddenly about the same period.

I thought too about a permission or scope issue but the node where the adapter is running has cloud-platform scope and the service account has Monitoring Viewer permission.

I also think that it is not a permission or scope issue as we have timeout errors.

I am still investigating. Maybe something has changed on GCP side.

PaulRudin commented 1 year ago

OK - so in my case I had inadvertently changed the service account, so the permission denied problem has been fixed. But I do still see messages similar to those reported by others:

E0209 10:59:56.474287       1 timeout.go:135] post-timeout activity - time-elapsed: 15.70529ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0209 10:59:56.554654       1 writers.go:111] apiserver was unable to close cleanly the response writer: http: Handler timeout
E0209 10:59:56.555141       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0209 10:59:56.556250       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0209 10:59:56.558605       1 timeout.go:135] post-timeout activity - time-elapsed: 93.570353ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0209 10:59:56.559753       1 timeout.go:135] post-timeout activity - time-elapsed: 102.936145ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0209 10:59:56.560900       1 timeout.go:135] post-timeout activity - time-elapsed: 102.266556ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
wyardley commented 1 year ago

sorry to slightly hijack, but also curious what folks who are using Workload Identity do - do you create a GCP service account for the custom metrics adapter and bind it to the Kube service account? I assume the cluster role bindings don't provide Google API level access, and with workkload identity, I don't think the pod will implicitly have the creds of the nodepool's service account either.

https://github.com/GoogleCloudPlatform/k8s-stackdriver/issues/315 I guess kind of covers this

sosimon commented 1 year ago

We're seeing the same error messages but the adapter appears to be functional. It would be nice to understand what the errors mean and what changes we need to make to reduce the noise.

matiasah commented 1 year ago

@sosimon how do you test to know it is functional? are you reading a metric with kubectl ?

sosimon commented 1 year ago

@matiasah I think our HPA is working. Some of the logs shown here are same as the ones in https://github.com/GoogleCloudPlatform/k8s-stackdriver/issues/510. Not the auth errors, those should be, and need to be, resolved by providing the service account the right permissions.