Failed to grab CNI endpoint: the server is currently unable to handle the request

hiteshghia commented 2 years ago

Followed the instructions here to setup cni-metrics-helper. Except that created the resources via our automation tool rather than eksctl.

On inspecting the logs of cni-helper:


{"level":"info","ts":"2022-09-21T01:44:38.471Z","caller":"cni-metrics-helper/main.go:45","msg":"Constructed new logger instance"}
{"level":"info","ts":"2022-09-21T01:44:38.471Z","caller":"runtime/proc.go:250","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: true, LogLevel Debug"}
{"level":"info","ts":"2022-09-21T01:44:38.495Z","caller":"cni-metrics-helper/main.go:119","msg":"Using REGION=us-east-1 and CLUSTER_ID=us-east-1-k8s-cloud"}
{"level":"info","ts":"2022-09-21T01:45:08.495Z","caller":"runtime/proc.go:250","msg":"Collecting metrics ..."}
{"level":"info","ts":"2022-09-21T01:45:08.596Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=6)"}
{"level":"error","ts":"2022-09-21T01:47:18.033Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-sx47t:61678)"}
{"level":"error","ts":"2022-09-21T01:49:29.105Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-gdtcw:61678)"}
{"level":"error","ts":"2022-09-21T01:51:40.177Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xbc9b:61678)"}
{"level":"error","ts":"2022-09-21T01:53:51.249Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ktpbn:61678)"}
{"level":"error","ts":"2022-09-21T01:56:02.321Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-4tjv5:61678)"}

I followed [this](https://github.com/aws/amazon-vpc-cni-k8s/issues/1912) thread and added the region ENV variables.

I cannot curl from the cni-metrics-helper pod to any of the aws-node (vpc-cni) pods.
Security group allows all node to node communication on all ports.

**Environment**:
- Kubernetes version 1.22.3
- CNI Version 1.11.3
- OS (e.g: `cat /etc/os-release`):
- Kernel (e.g. `uname -a`): Linux IP .amzn2.x86_64 #1 SMP 2022 x86_64 x86_64 x86_64 GNU/Linux

jayanthvn commented 2 years ago

@hiteshghia - Are you using IRSA? Can you also send your clusterARN to k8s-awscni-triage@amazon.com?

hiteshghia commented 2 years ago

Yes using IRSA. Will email, thanks!

hiteshghia commented 2 years ago

Since the vpc cni daemonset pods use hostNetwork, they would use the host/node dns resolver and not the cluster dns (coredns) and in that case the cni-metrics-helper pod wont be able to reach the aws-node:61678? And looks like thats what the cni metrics helper is trying to do, should that work?

hiteshghia commented 2 years ago

Never-mind, I see it is using the restclient from client-go. I have already emailed to k8s-awscni-triage@amazon.com as well, please let me know what other info do you need. Thanks.

hiteshghia commented 2 years ago

Ran a little go script locally doing the same thing as metrics.go is doing:

res := clientset.CoreV1().RESTClient().Get().
        Namespace("kube-system").
        Resource("pods").
        Name("aws-node-ksvqx:61678").
        SubResource("proxy").
        Suffix("metrics").
        Do(ctx)

And get the same response back: panic: the server is currently unable to handle the request (get pods aws-node-ksvqx:61678)

hiteshghia commented 2 years ago

Tried with version 1.10.2 version of vpc cni and getting the same issue.

jayanthvn commented 2 years ago

We tried curl from cni-metrics-helper pod to aws-node on the same node. There is no connectivity issue and we were able to query the metrics. So via API-server there seems to be permission issues can you please double check the IRSA role/permissions for cni-metrics-helper.

curl 10.6.12.236:61678/metrics
# HELP awscni_add_ip_req_count The number of add IP address requests
# TYPE awscni_add_ip_req_count counter
awscni_add_ip_req_count 2
# HELP awscni_assigned_ip_addresses The number of IP addresses assigned to pods
# TYPE awscni_assigned_ip_addresses gauge
awscni_assigned_ip_addresses 1
# HELP awscni_assigned_ip_per_cidr The total number of IP addresses assigned per cidr
# TYPE awscni_assigned_ip_per_cidr gauge
awscni_assigned_ip_per_cidr{cidr="10.6.13.11/32"} 1
awscni_assigned_ip_per_cidr{cidr="10.6.14.193/32"} 0
# HELP awscni_aws_api_latency_ms AWS API call latency in ms
# TYPE awscni_aws_api_latency_ms summary

hiteshghia commented 2 years ago

Tried making all the addons (coredns, kube-proxy and cni) eks managed instead of self-managed and same issue persists. Also I ran this exact command that metrics.go runs, see here and it gave the same error. For reference we have another old cluster in the same account and that script worked just fine, talking directly to aws-node pods using the client-go rest client. what exactly should I be checking for with IRSA? These are the env vars for the cni metrics pod -

  - env:
    - name: AWS_CLUSTER_ID
      value: us-east-1-k8s-cloud
    - name: USE_CLOUDWATCH
      value: "true"
    - name: AWS_REGION
      value: us-east-1
    - name: AWS_DEFAULT_REGION
      value: us-east-1
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::xxxxxxxxxx:role/eks-vpc-cni-metrics-helper-cloud
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

This is the cni metrics helper service account -

kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxx:role/eks-vpc-cni-metrics-helper-cloud
  creationTimestamp: "2022-09-21T01:10:19Z"
  labels:
    app.kubernetes.io/instance: cni-metrics-helper
    app.kubernetes.io/name: cni-metrics-helper
    app.kubernetes.io/version: v1.11.3
    environment: cloud
  name: cni-metrics-helper
  namespace: kube-system
secrets:
- name: cni-metrics-helper-token-pvz2c

This is the policy attached to that role:

    "Statement": [
        {
            "Action": "cloudwatch:PutMetricData",
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "eksvpccnimetricshelper"
        }
    ],
    "Version": "2012-10-17"
}

And the following trust relationship:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "eksvpccnimetricshelpertrustpolicy",
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::xxxxxxxxxx:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXX"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXX:aud": "sts.amazonaws.com",
                    "oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXX:sub": "system:serviceaccount:kube-system:cni-metrics-helper"
                }
            }
        }
    ]
}

This is the clusterrolebinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
  creationTimestamp: "2022-09-21T01:10:22Z"
  labels:
    app.kubernetes.io/instance: cni-metrics-helper
    app.kubernetes.io/name: cni-metrics-helper
    app.kubernetes.io/version: v1.11.3
    environment: cloud
  name: cni-metrics-helper
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cni-metrics-helper
subjects:
- kind: ServiceAccount
  name: cni-metrics-helper
  namespace: kube-system

And this is the clusterrole

  apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
  creationTimestamp: "2022-09-21T01:10:20Z"
  labels:
    environment: cloud
  name: cni-metrics-helper
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - pods/proxy
  verbs:
  - get
  - watch
  - list

Also note that the cni metrics helper pod is using irsa but the vpc cni itself is relying on the node IAM role.

hiteshghia commented 2 years ago

We found the issue on our side. Since cni metrics helper uses the kubectl pod proxy to get to the aws-node pods metrics endpoint which is served at port 61678, we had to open up that port from the clusters(control plane nodes) security group to the worker nodes security group for EKS. Closing this issue. Thanks.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

aws / amazon-vpc-cni-k8s

Failed to grab CNI endpoint: the server is currently unable to handle the request #2091

⚠️COMMENT VISIBILITY WARNING⚠️