Closed hiteshghia closed 2 years ago
@hiteshghia - Are you using IRSA? Can you also send your clusterARN to k8s-awscni-triage@amazon.com?
Yes using IRSA. Will email, thanks!
Since the vpc cni daemonset pods use hostNetwork, they would use the host/node dns resolver and not the cluster dns (coredns) and in that case the cni-metrics-helper pod wont be able to reach the aws-node:61678? And looks like thats what the cni metrics helper is trying to do, should that work?
Never-mind, I see it is using the restclient from client-go. I have already emailed to k8s-awscni-triage@amazon.com as well, please let me know what other info do you need. Thanks.
Ran a little go script locally doing the same thing as metrics.go is doing:
res := clientset.CoreV1().RESTClient().Get().
Namespace("kube-system").
Resource("pods").
Name("aws-node-ksvqx:61678").
SubResource("proxy").
Suffix("metrics").
Do(ctx)
And get the same response back:
panic: the server is currently unable to handle the request (get pods aws-node-ksvqx:61678)
Tried with version 1.10.2 version of vpc cni and getting the same issue.
We tried curl from cni-metrics-helper pod to aws-node on the same node. There is no connectivity issue and we were able to query the metrics. So via API-server there seems to be permission issues can you please double check the IRSA role/permissions for cni-metrics-helper.
curl 10.6.12.236:61678/metrics
# HELP awscni_add_ip_req_count The number of add IP address requests
# TYPE awscni_add_ip_req_count counter
awscni_add_ip_req_count 2
# HELP awscni_assigned_ip_addresses The number of IP addresses assigned to pods
# TYPE awscni_assigned_ip_addresses gauge
awscni_assigned_ip_addresses 1
# HELP awscni_assigned_ip_per_cidr The total number of IP addresses assigned per cidr
# TYPE awscni_assigned_ip_per_cidr gauge
awscni_assigned_ip_per_cidr{cidr="10.6.13.11/32"} 1
awscni_assigned_ip_per_cidr{cidr="10.6.14.193/32"} 0
# HELP awscni_aws_api_latency_ms AWS API call latency in ms
# TYPE awscni_aws_api_latency_ms summary
Tried making all the addons (coredns, kube-proxy and cni) eks managed instead of self-managed and same issue persists. Also I ran this exact command that metrics.go runs, see here and it gave the same error. For reference we have another old cluster in the same account and that script worked just fine, talking directly to aws-node pods using the client-go rest client. what exactly should I be checking for with IRSA? These are the env vars for the cni metrics pod -
- env:
- name: AWS_CLUSTER_ID
value: us-east-1-k8s-cloud
- name: USE_CLOUDWATCH
value: "true"
- name: AWS_REGION
value: us-east-1
- name: AWS_DEFAULT_REGION
value: us-east-1
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: AWS_ROLE_ARN
value: arn:aws:iam::xxxxxxxxxx:role/eks-vpc-cni-metrics-helper-cloud
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
This is the cni metrics helper service account -
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxx:role/eks-vpc-cni-metrics-helper-cloud
creationTimestamp: "2022-09-21T01:10:19Z"
labels:
app.kubernetes.io/instance: cni-metrics-helper
app.kubernetes.io/name: cni-metrics-helper
app.kubernetes.io/version: v1.11.3
environment: cloud
name: cni-metrics-helper
namespace: kube-system
secrets:
- name: cni-metrics-helper-token-pvz2c
This is the policy attached to that role:
"Statement": [
{
"Action": "cloudwatch:PutMetricData",
"Effect": "Allow",
"Resource": "*",
"Sid": "eksvpccnimetricshelper"
}
],
"Version": "2012-10-17"
}
And the following trust relationship:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "eksvpccnimetricshelpertrustpolicy",
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::xxxxxxxxxx:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXX"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXX:aud": "sts.amazonaws.com",
"oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXXXXX:sub": "system:serviceaccount:kube-system:cni-metrics-helper"
}
}
}
]
}
This is the clusterrolebinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
creationTimestamp: "2022-09-21T01:10:22Z"
labels:
app.kubernetes.io/instance: cni-metrics-helper
app.kubernetes.io/name: cni-metrics-helper
app.kubernetes.io/version: v1.11.3
environment: cloud
name: cni-metrics-helper
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cni-metrics-helper
subjects:
- kind: ServiceAccount
name: cni-metrics-helper
namespace: kube-system
And this is the clusterrole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
creationTimestamp: "2022-09-21T01:10:20Z"
labels:
environment: cloud
name: cni-metrics-helper
rules:
- apiGroups:
- ""
resources:
- pods
- pods/proxy
verbs:
- get
- watch
- list
Also note that the cni metrics helper pod is using irsa but the vpc cni itself is relying on the node IAM role.
We found the issue on our side. Since cni metrics helper uses the kubectl pod proxy to get to the aws-node pods metrics endpoint which is served at port 61678, we had to open up that port from the clusters(control plane nodes) security group to the worker nodes security group for EKS. Closing this issue. Thanks.
Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
Followed the instructions here to setup cni-metrics-helper. Except that created the resources via our automation tool rather than eksctl.
On inspecting the logs of cni-helper: