youwalther65 commented 2 years ago

What happened: CNI helper pod is running but not able to scape metrics from aws-node pods $ k get clusterrole cni-metrics-helper NAME CREATED AT cni-metrics-helper 2022-03-07T18:37:14Z

$ k get clusterrolebinding cni-metrics-helper NAME ROLE AGE cni-metrics-helper ClusterRole/cni-metrics-helper 23m

$ k get deploy -n kube-system cni-metrics-helper NAME READY UP-TO-DATE AVAILABLE AGE cni-metrics-helper 1/1 1 1 21m

$ k get deploy -n kube-system cni-metrics-helper -o yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2022-03-07T18:37:14Z" generation: 1 labels: k8s-app: cni-metrics-helper kustomize.toolkit.fluxcd.io/name: flux-infrastructure kustomize.toolkit.fluxcd.io/namespace: flux-system name: cni-metrics-helper namespace: kube-system resourceVersion: "119599" uid: 1589a869-2cbc-439d-9b8d-6f7d9ee693f8 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: cni-metrics-helper strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: k8s-app: cni-metrics-helper spec: containers:

env:
- name: AWS_CLUSTER_ID value: git-eks-demo-ipv4
- name: USE_CLOUDWATCH value: "true" image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/cni-metrics-helper:v1.10.2 imagePullPolicy: IfNotPresent name: cni-metrics-helper resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: cni-metrics-helper serviceAccountName: cni-metrics-helper terminationGracePeriodSeconds: 30

Attach logs $ k logs -n kube-system cni-metrics-helper-5dff487d97-q2n6d ... {"level":"debug","ts":"2022-03-07T18:38:01.245Z","caller":"metrics/metrics.go:261","msg":"Reset detected resetDetected: false, noPreviousDataPoint: true, noCurrentDataPoint: false"} {"level":"error","ts":"2022-03-07T18:40:11.293Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ns559:61678)"} {"level":"error","ts":"2022-03-07T18:42:22.365Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-w9qnr:61678)"} {"level":"error","ts":"2022-03-07T18:44:33.437Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-jqq92:61678)"} {"level":"info","ts":"2022-03-07T18:44:33.437Z","caller":"runtime/proc.go:255","msg":"Collecting metrics ..."} {"level":"info","ts":"2022-03-07T18:44:33.437Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=4)"} {"level":"error","ts":"2022-03-07T18:46:44.508Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ns559:61678)"} {"level":"debug","ts":"2022-03-07T18:46:44.519Z","caller":"metrics/metrics.go:382","msg":"cni-metrics text output: # HELP awscni_add_ip_req_count The number of add IP address requests\n# TYPE awscni_add_ip_req_count counter\nawscni_add_ip_req_count 0\n# HELP awscni_assigned_ip_addresses The number of IP addresses assigned to pods\n# TYPE awscni_assigned_ip_addresses gauge\nawscni_assigned_ip_addresses 0\n# HELP awscni_aws_api_latency_ms AWS API call latency in ms\n# TYPE awscni_aws_api_latency_ms summary\nawscni_aws_api_latency_ms_sum{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 278\nawscni_aws_api_latency_ms_count{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 1\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"false\",status=\"200\"} 640\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"false\",status=\"200\"} 3191\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"true\",status=\"404\"} 53\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"true\",status=\"404\"} 319\nawscni_aws_api_latency_ms_sum{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 380\nawscni_aws_api_latency_ms_count{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 1\n# HELP awscni_build_info A metric with a constant '1' value labeled by version, revision, and goversion from which amazon-vpc-cni-k8s was built.\n# TYPE awscni_build_info gauge\nawscni_build_info{goversion=\"go1.16.10\",version=\"\"} 1\n# HELP awscni_eni_allocated The number of ENIs allocated\n# TYPE awscni_eni_allocated gauge\nawscni_eni_allocated 1\n# HELP awscni_eni_max The maximum number of ENIs that can be attached to the instance, accounting for unmanaged ENIs\n# TYPE awscni_eni_max gauge\nawscni_eni_max 3\n# HELP awscni_force_removed_enis The number of ENIs force removed while they had assigned pods\n# TYPE awscni_force_removed_enis counter\nawscni_force_removed_enis 0\n# HELP awscni_force_removed_ips The number of IPs force removed while they had assigned pods\n# TYPE awscni_force_removed_ips counter\nawscni_force_removed_ips 0\n# HELP awscni_ip_max The maximum number of IP addresses that can be allocated to the instance\n# TYPE awscni_ip_max gauge\nawscni_ip_max 15\n# HELP awscni_ipamd_action_inprogress The number of ipamd actions in progress\n# TYPE awscni_ipamd_action_inprogress gauge\nawscni_ipamd_action_inprogress{fn=\"nodeIPPoolReconcile\"} 0\nawscni_ipamd_action_inprogress{fn=\"nodeInit\"} 0\n# HELP awscni_reconcile_count The number of times ipamd reconciles on ENIs and IP/Prefix addresses\n# TYPE awscni_reconcile_count counter\nawscni_reconcile_count{fn=\"eniDataStorePoolReconcileAdd\"} 1585\n# HELP awscni_total_ip_addresses The total number of IP addresses\n# TYPE awscni_total_ip_addresses gauge\nawscni_total_ip_addresses 5\n# HELP awscni_total_ipv4_prefixes The total number of IPv4 prefixes\n# TYPE awscni_total_ipv4_prefixes gauge\nawscni_total_ipv4_prefixes 0\n# HELP go_gc_duration_seconds A summary of the GC invocation durations.\n# TYPE go_gc_duration_seconds summary\ngo_gc_duration_seconds{quantile=\"0\"} 3.2051e-05\ngo_gc_duration_seconds{quantile=\"0.25\"} 4.746e-05\ngo_gc_duration_seconds{quantile=\"0.5\"} 5.3798e-05\ngo_gc_duration_seconds{quantile=\"0.75\"} 7.3225e-05\ngo_gc_duration_seconds{quantile=\"1\"} 0.001240274\ngo_gc_duration_seconds_sum 0.011943986\ngo_gc_duration_seconds_count 163\n# HELP go_goroutines Number of goroutines that currently exist.\n# TYPE go_goroutines gauge\ngo_goroutines 37\n# HELP go_info Information about the Go environment.\n# TYPE go_info gauge\ngo_info{version=\"go1.16.10\"} 1\n# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.\n# TYPE go_memstats_alloc_bytes gauge\ngo_memstats_alloc_bytes 5.749584e+06\n# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.\n# TYPE go_memstats_alloc_bytes_total counter\ngo_memstats_alloc_bytes_total 5.1863536e+08\n# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.\n# TYPE go_memstats_buck_hash_sys_bytes gauge\ngo_memstats_buck_hash_sys_bytes 1.490576e+06\n# HELP go_memstats_frees_total Total number of frees.\n# TYPE go_memstats_frees_total counter\ngo_memstats_frees_total 1.529438e+06\n# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.\n# TYPE go_memstats_gc_cpu_fraction gauge\ngo_memstats_gc_cpu_fraction 2.9108407286280238e-06\n# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.\n# TYPE go_memstats_gc_sys_bytes gauge\ngo_memstats_gc_sys_bytes 5.616304e+06\n# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.\n# TYPE go_memstats_heap_alloc_bytes gauge\ngo_memstats_heap_alloc_bytes 5.749584e+06\n# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.\n# TYPE go_memstats_heap_idle_bytes gauge\ngo_memstats_heap_idle_bytes 5.8212352e+07\n# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.\n# TYPE go_memstats_heap_inuse_bytes gauge\ngo_memstats_heap_inuse_bytes 8.208384e+06\n# HELP go_memstats_heap_objects Number of allocated objects.\n# TYPE go_memstats_heap_objects gauge\ngo_memstats_heap_objects 29446\n# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.\n# TYPE go_memstats_heap_released_bytes gauge\ngo_memstats_heap_released_bytes 5.5681024e+07\n# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.\n# TYPE go_memstats_heap_sys_bytes gauge\ngo_memstats_heap_sys_bytes 6.6420736e+07\n# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.\n# TYPE go_memstats_last_gc_time_seconds gauge\ngo_memstats_last_gc_time_seconds 1.6466787616502016e+09\n# HELP go_memstats_lookups_total Total number of pointer lookups.\n# TYPE go_memstats_lookups_total counter\ngo_memstats_lookups_total 0\n# HELP go_memstats_mallocs_total Total number of mallocs.\n# TYPE go_memstats_mallocs_total counter\ngo_memstats_mallocs_total 1.558884e+06\n# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.\n# TYPE go_memstats_mcache_inuse_bytes gauge\ngo_memstats_mcache_inuse_bytes 2400\n# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.\n# TYPE go_memstats_mcache_sys_bytes gauge\ngo_memstats_mcache_sys_bytes 16384\n# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.\n# TYPE go_memstats_mspan_inuse_bytes gauge\ngo_memstats_mspan_inuse_bytes 119952\n# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.\n# TYPE go_memstats_mspan_sys_bytes gauge\ngo_memstats_mspan_sys_bytes 147456\n# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.\n# TYPE go_memstats_next_gc_bytes gauge\ngo_memstats_next_gc_bytes 8.87776e+06\n# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.\n# TYPE go_memstats_other_sys_bytes gauge\ngo_memstats_other_sys_bytes 676552\n# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.\n# TYPE go_memstats_stack_inuse_bytes gauge\ngo_memstats_stack_inuse_bytes 688128\n# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.\n# TYPE go_memstats_stack_sys_bytes gauge\ngo_memstats_stack_sys_bytes 688128\n# HELP go_memstats_sys_bytes Number of bytes obtained from system.\n# TYPE go_memstats_sys_bytes gauge\ngo_memstats_sys_bytes 7.5056136e+07\n# HELP go_threads Number of OS threads created.\n# TYPE go_threads gauge\ngo_threads 8\n# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.\n# TYPE process_cpu_seconds_total counter\nprocess_cpu_seconds_total 7.82\n# HELP process_max_fds Maximum number of open file descriptors.\n# TYPE process_max_fds gauge\nprocess_max_fds 1.048576e+06\n# HELP process_open_fds Number of open file descriptors.\n# TYPE process_open_fds gauge\nprocess_open_fds 20\n# HELP process_resident_memory_bytes Resident memory size in bytes.\n# TYPE process_resident_memory_bytes gauge\nprocess_resident_memory_bytes 5.7962496e+07\n# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.\n# TYPE process_start_time_seconds gauge\nprocess_start_time_seconds 1.64665979817e+09\n# HELP process_virtual_memory_bytes Virtual memory size in bytes.\n# TYPE process_virtual_memory_bytes gauge\nprocess_virtual_memory_bytes 7.78473472e+08\n# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.\n# TYPE process_virtual_memory_max_bytes gauge\nprocess_virtual_memory_max_bytes -1\n# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.\n# TYPE promhttp_metric_handler_requests_in_flight gauge\npromhttp_metric_handler_requests_in_flight 1\n# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.\n# TYPE promhttp_metric_handler_requests_total counter\npromhttp_metric_handler_requests_total{code=\"200\"} 1\npromhttp_metric_handler_requests_total{code=\"500\"} 0\npromhttp_metric_handler_requests_total{code=\"503\"} 0\n"}

ServiceAccount is using IRSA $ k get sa -n kube-system cni-metrics-helper -o yaml | head -6 apiVersion: v1 kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam:::role/AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4 creationTimestamp: "2022-03-07T18:37:14Z"

$ aws iam get-role --role-name AmazonEKSVPCCNIMe tricsHelperRole-git-eks-demo-ipv4 { "Role": { "Path": "/", "RoleName": "AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4", "RoleId": "AROAZAC4CGT7ZTEGU53VD", "Arn": "arn:aws:iam:::role/AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4", "CreateDate": "2022-03-07T17:27:29Z", "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam:::oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.eu-west-1.amazonaws.com/id/:sub": "system:serviceaccount:kube-system:cni-metrics-helper" } } } ]

Proper policy is attached: $ aws iam list-attached-role-policies --role-name AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4 { "AttachedPolicies": [ { "PolicyName": "AmazonEKSVPCCNIMetricsHelperPolicy-git-eks-demo-ipv4", "PolicyArn": "arn:aws:iam:::policy/AmazonEKSVPCCNIMetricsHelperPolicy-git-eks-demo-ipv4" } ] }

Instances have following IMDS settings "MetadataOptions": { "State": "applied", "HttpTokens": "required", "HttpPutResponseHopLimit": 2, "HttpEndpoint": "enabled", "HttpProtocolIpv6": "disabled" },

What you expected to happen: Scrape CNI metrics from aws-node pods and publish to CloudWatch

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.21 platform version eks.4
CNI Version: 1.10.2
OS (e.g: cat /etc/os-release): AL2
Kernel (e.g. uname -a): uname -a Linux ip-10-0-3-119.eu-west-1.compute.internal 5.4.176-91.338.amzn2.x86_64 #1 SMP Fri Feb 4 16:59:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

cgchinmay commented 2 years ago

Will check and get back to you soon.

cgchinmay commented 2 years ago

If you were using IRSA then the region field should have been auto- injected. I am not sure why you don't see it in your deployment spec for cni-metrics-helper. Will check your cluster setup. Could you share your cluster-arn to k8s-awscni-triage@amazon.com Meanwhile

Could you check cni-metrics-helper logs, it should display at the top what region and cluster id values are being used
Could you manually try to add AWS_REGION as an env var in your cni-metrics helper deployment spec

- name: AWS_REGION
value: <your region>

Thanks

youwalther65 commented 2 years ago

I followed installation instructions from: https://docs.aws.amazon.com/eks/latest/userguide/cni-metrics-helper.html

This points to the following YAML for SA, RBAC and deployment: $ curl -o cni-metrics-helper.yaml https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.10/config/master/cni-metrics-helper.yaml

This already has some duplicates in deployment code:

Source: cni-metrics-helper/templates/deployment.yaml

kind: Deployment apiVersion: apps/v1 metadata: name: cni-metrics-helper namespace: kube-system labels: k8s-app: cni-metrics-helper spec: selector: matchLabels: k8s-app: cni-metrics-helper template: metadata: labels: k8s-app: cni-metrics-helper spec: containers:

env:

Optional: Should be ClusterName/ClusterIdentifier used as the metric dimension

**- name: AWS_CLUSTER_ID
  value: ""**
- name: USE_CLOUDWATCH
  value: "true"
# Optional: Should be ClusterName/ClusterIdentifier used as the metric dimension
**- name: AWS_CLUSTER_ID
  value: ""**
name: cni-metrics-helper
image: "602401143452.dkr.ecr.us-west-2.amazonaws.com/cni-metrics-helper:v1.10.2"

serviceAccountName: cni-metrics-helper

I will substiute here AWS_REGION for the second AWS_CLUSTER_ID and check

cgchinmay commented 2 years ago

Could you check this Readme : https://github.com/aws/amazon-vpc-cni-k8s/tree/master/cmd/cni-metrics-helper Also share your cluster-arn to: k8s-awscni-triage@amazon.com . This will help me inspect your deployment spec and service accounts. Thanks

youwalther65 commented 2 years ago

Same HTTP 503 messages but I can confirm that region is used in deployment now. $ k logs -n kube-system cni-metrics-helper-75cb84c9f8-r2wgn {"level":"info","ts":"2022-03-08T07:15:30.793Z","caller":"runtime/proc.go:255","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: true, LogLevel Debug"} I0308 07:15:31.850529 1 request.go:621] Throttling request took 1.037923334s, request: GET:https://172.20.0.1:443/apis/kustomize.toolkit.fluxcd.io/v1beta1?timeout=32s {"level":"info","ts":"2022-03-08T07:15:38.815Z","caller":"cni-metrics-helper/main.go:113","msg":"Using REGION=eu-west-1 and CLUSTER_ID=git-eks-demo-ipv4"} {"level":"info","ts":"2022-03-08T07:16:08.816Z","caller":"runtime/proc.go:255","msg":"Collecting metrics ..."} {"level":"info","ts":"2022-03-08T07:16:08.916Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=4)"} {"level":"debug","ts":"2022-03-08T07:16:08.922Z","caller":"metrics/metrics.go:382","msg":"cni-metrics text output: # HELP awscni_add_ip_req_count The number of add IP address requests\n# TYPE awscni_add_ip_req_count counter\nawscni_add_ip_req_count 0\n# HELP awscni_assigned_ip_addresses The number of IP addresses assigned to pods\n# TYPE awscni_assigned_ip_addresses gauge\nawscni_assigned_ip_addresses 0\n# HELP awscni_aws_api_latency_ms AWS API call latency in ms\n# TYPE awscni_aws_api_latency_ms summary\nawscni_aws_api_latency_ms_sum{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 278\nawscni_aws_api_latency_ms_count{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 1\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"false\",status=\"200\"} 1789\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"false\",status=\"200\"} 10683\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"true\",status=\"404\"} 166\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"true\",status=\"404\"} 1068\nawscni_aws_api_latency_ms_sum{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 380\nawscni_aws_api_latency_ms_count{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 1\n# HELP awscni_build_info A metric with a constant '1' value labeled by version, revision, and goversion from which amazon-vpc-cni-k8s was built.\n# TYPE awscni_build_info gauge\nawscni_build_info{goversion=\"go1.16.10\",version=\"\"} 1\n# HELP awscni_eni_allocated The number of ENIs allocated\n# TYPE awscni_eni_allocated gauge\nawscni_eni_allocated 1\n# HELP awscni_eni_max The maximum number of ENIs that can be attached to the instance, accounting for unmanaged ENIs\n# TYPE awscni_eni_max gauge\nawscni_eni_max 3\n# HELP awscni_force_removed_enis The number of ENIs force removed while they had assigned pods\n# TYPE awscni_force_removed_enis counter\nawscni_force_removed_enis 0\n# HELP awscni_force_removed_ips The number of IPs force removed while they had assigned pods\n# TYPE awscni_force_removed_ips counter\nawscni_force_removed_ips 0\n# HELP awscni_ip_max The maximum number of IP addresses that can be allocated to the instance\n# TYPE awscni_ip_max gauge\nawscni_ip_max 15\n# HELP awscni_ipamd_action_inprogress The number of ipamd actions in progress\n# TYPE awscni_ipamd_action_inprogress gauge\nawscni_ipamd_action_inprogress{fn=\"nodeIPPoolReconcile\"} 0\nawscni_ipamd_action_inprogress{fn=\"nodeInit\"} 0\n# HELP awscni_reconcile_count The number of times ipamd reconciles on ENIs and IP/Prefix addresses\n# TYPE awscni_reconcile_count counter\nawscni_reconcile_count{fn=\"eniDataStorePoolReconcileAdd\"} 5330\n# HELP awscni_total_ip_addresses The total number of IP addresses\n# TYPE awscni_total_ip_addresses gauge\nawscni_total_ip_addresses 5\n# HELP awscni_total_ipv4_prefixes The total number of IPv4 prefixes\n# TYPE awscni_total_ipv4_prefixes gauge\nawscni_total_ipv4_prefixes 0\n# HELP go_gc_duration_seconds A summary of the GC invocation durations.\n# TYPE go_gc_duration_seconds summary\ngo_gc_duration_seconds{quantile=\"0\"} 4.0646e-05\ngo_gc_duration_seconds{quantile=\"0.25\"} 5.2208e-05\ngo_gc_duration_seconds{quantile=\"0.5\"} 7.3721e-05\ngo_gc_duration_seconds{quantile=\"0.75\"} 0.000102977\ngo_gc_duration_seconds{quantile=\"1\"} 0.001945569\ngo_gc_duration_seconds_sum 0.061231686\ngo_gc_duration_seconds_count 544\n# HELP go_goroutines Number of goroutines that currently exist.\n# TYPE go_goroutines gauge\ngo_goroutines 37\n# HELP go_info Information about the Go environment.\n# TYPE go_info gauge\ngo_info{version=\"go1.16.10\"} 1\n# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.\n# TYPE go_memstats_alloc_bytes gauge\ngo_memstats_alloc_bytes 4.73988e+06\n# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.\n# TYPE go_memstats_alloc_bytes_total counter\ngo_memstats_alloc_bytes_total 1.817476792e+09\n# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.\n# TYPE go_memstats_buck_hash_sys_bytes gauge\ngo_memstats_buck_hash_sys_bytes 1.54928e+06\n# HELP go_memstats_frees_total Total number of frees.\n# TYPE go_memstats_frees_total counter\ngo_memstats_frees_total 4.886522e+06\n# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.\n# TYPE go_memstats_gc_cpu_fraction gauge\ngo_memstats_gc_cpu_fraction 5.9174801745196725e-06\n# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.\n# TYPE go_memstats_gc_sys_bytes gauge\ngo_memstats_gc_sys_bytes 5.626544e+06\n# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.\n# TYPE go_memstats_heap_alloc_bytes gauge\ngo_memstats_heap_alloc_bytes 4.73988e+06\n# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.\n# TYPE go_memstats_heap_idle_bytes gauge\ngo_memstats_heap_idle_bytes 5.9006976e+07\n# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.\n# TYPE go_memstats_heap_inuse_bytes gauge\ngo_memstats_heap_inuse_bytes 7.479296e+06\n# HELP go_memstats_heap_objects Number of allocated objects.\n# TYPE go_memstats_heap_objects gauge\ngo_memstats_heap_objects 26010\n# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.\n# TYPE go_memstats_heap_released_bytes gauge\ngo_memstats_heap_released_bytes 5.636096e+07\n# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.\n# TYPE go_memstats_heap_sys_bytes gauge\ngo_memstats_heap_sys_bytes 6.6486272e+07\n# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.\n# TYPE go_memstats_last_gc_time_seconds gauge\ngo_memstats_last_gc_time_seconds 1.6467237672096214e+09\n# HELP go_memstats_lookups_total Total number of pointer lookups.\n# TYPE go_memstats_lookups_total counter\ngo_memstats_lookups_total 0\n# HELP go_memstats_mallocs_total Total number of mallocs.\n# TYPE go_memstats_mallocs_total counter\ngo_memstats_mallocs_total 4.912532e+06\n# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.\n# TYPE go_memstats_mcache_inuse_bytes gauge\ngo_memstats_mcache_inuse_bytes 2400\n# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.\n# TYPE go_memstats_mcache_sys_bytes gauge\ngo_memstats_mcache_sys_bytes 16384\n# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.\n# TYPE go_memstats_mspan_inuse_bytes gauge\ngo_memstats_mspan_inuse_bytes 119544\n# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.\n# TYPE go_memstats_mspan_sys_bytes gauge\ngo_memstats_mspan_sys_bytes 147456\n# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.\n# TYPE go_memstats_next_gc_bytes gauge\ngo_memstats_next_gc_bytes 9.173072e+06\n# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.\n# TYPE go_memstats_other_sys_bytes gauge\ngo_memstats_other_sys_bytes 607608\n# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.\n# TYPE go_memstats_stack_inuse_bytes gauge\ngo_memstats_stack_inuse_bytes 622592\n# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.\n# TYPE go_memstats_stack_sys_bytes gauge\ngo_memstats_stack_sys_bytes 622592\n# HELP go_memstats_sys_bytes Number of bytes obtained from system.\n# TYPE go_memstats_sys_bytes gauge\ngo_memstats_sys_bytes 7.5056136e+07\n# HELP go_threads Number of OS threads created.\n# TYPE go_threads gauge\ngo_threads 8\n# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.\n# TYPE process_cpu_seconds_total counter\nprocess_cpu_seconds_total 26.97\n# HELP process_max_fds Maximum number of open file descriptors.\n# TYPE process_max_fds gauge\nprocess_max_fds 1.048576e+06\n# HELP process_open_fds Number of open file descriptors.\n# TYPE process_open_fds gauge\nprocess_open_fds 20\n# HELP process_resident_memory_bytes Resident memory size in bytes.\n# TYPE process_resident_memory_bytes gauge\nprocess_resident_memory_bytes 5.7884672e+07\n# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.\n# TYPE process_start_time_seconds gauge\nprocess_start_time_seconds 1.64665979817e+09\n# HELP process_virtual_memory_bytes Virtual memory size in bytes.\n# TYPE process_virtual_memory_bytes gauge\nprocess_virtual_memory_bytes 7.78473472e+08\n# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.\n# TYPE process_virtual_memory_max_bytes gauge\nprocess_virtual_memory_max_bytes -1\n# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.\n# TYPE promhttp_metric_handler_requests_in_flight gauge\npromhttp_metric_handler_requests_in_flight 1\n# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.\n# TYPE promhttp_metric_handler_requests_total counter\npromhttp_metric_handler_requests_total{code=\"200\"} 116\npromhttp_metric_handler_requests_total{code=\"500\"} 0\npromhttp_metric_handler_requests_total{code=\"503\"} 0\n"} {"level":"debug","ts":"2022-03-08T07:16:08.923Z","caller":"metrics/metrics.go:261","msg":"Reset detected resetDetected: false, noPreviousDataPoint: true, noCurrentDataPoint: false"}

Interesting to note that now the AWS_DEFAULT_REGION env is not inserted.

spec: containers:

env:
- name: AWS_CLUSTER_ID value: git-eks-demo-ipv4
- name: USE_CLOUDWATCH value: "true"
- name: AWS_REGION value: eu-west-1
- name: AWS_ROLE_ARN value: arn:aws:iam:::role/AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4
- name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

I will check the GitHub docu and send email in a minute.

youwalther65 commented 2 years ago

Interesting to see why IRSA does not inject bot AWS_REGION and AWS_DEFAULT_REGION. Did it manually in delyment and now NTH works: $ k logs -n kube-system aws-node-termination-handler-6f846dcb79-rm6hl 2022/03/07 15:43:17 INF Starting to serve handler /healthz, port 8080 2022/03/07 15:43:17 INF Startup Metadata Retrieved metadata={"accountId":"","availabilityZone":"eu-west-1a","instanceId":"i-0xxx","instanceLifeCycle":"on-demand","instanceType":"t3.large","localHostname":"ip-xxx.eu-west-1.compute.internal","privateIp":"10.0.x.y","publicHostname":"","publicIp":"","region":"eu-west-1"} 2022/03/07 15:43:17 INF aws-node-termination-handler arguments: dry-run: false, node-name: ip-10-0-1-200.eu-west-1.compute.internal, metadata-url: http://169.254.169.254, kubernetes-service-host: 172.20.0.1, kubernetes-service-port: 443, delete-local-data: true, ignore-daemon-sets: true, pod-termination-grace-period: -1, node-termination-grace-period: 120, enable-scheduled-event-draining: false, enable-spot-interruption-draining: false, enable-sqs-termination-draining: true, enable-rebalance-monitoring: false, enable-rebalance-draining: false, metadata-tries: 3, cordon-only: false, taint-node: false, taint-effect: NoSchedule, json-logging: false, log-level: info, webhook-proxy: , webhook-headers: , webhook-url: , webhook-template: , uptime-from-file: , enable-prometheus-server: false, prometheus-server-port: 9092, emit-kubernetes-events: false, kubernetes-events-extra-annotations: , aws-region: eu-west-1, queue-url: https://sqs.eu-west-1.queue.amazonaws.com//git-eks-demo-ipv4-karpenter, check-asg-tag-before-draining: false, managed-asg-tag: aws-node-termination-handler/managed, assume-asg-tag-propagation: false, aws-endpoint: ,

2022/03/07 15:43:17 INF Started watching for interruption events 2022/03/07 15:43:17 INF Kubernetes AWS Node Termination Handler has started successfully! 2022/03/07 15:43:17 INF Started watching for event cancellations 2022/03/07 15:43:17 INF Started monitoring for events event_type=SQS_TERMINATE

Now I see metrics in CW.

cgchinmay commented 2 years ago

Hi @youwalther65 I followed the steps mentioned here: https://docs.aws.amazon.com/eks/latest/userguide/cni-metrics-helper.html . I tried in ap-east-1 region and was able to find cni-metrics-helper pod injected with AWS_REGION as well as AWS_DEFAULT_REGION fields. I created serviceaccount using eksctl (just fyi). I am not sure what went wrong in your case but you can give it a one more try with a fresh install. I also verified the metrics being published.

Note: there was duplicate AWS_CLUSTER_ID field in the manifest file. I am not sure if that could have anyway affected but fixing it.

AWS_CLUSTER_ID:               test
USE_CLOUDWATCH:               true
AWS_DEFAULT_REGION:           ap-east-1
AWS_REGION:                   ap-east-1

Will wait for your response before we can close the issue.

youwalther65 commented 2 years ago

It woks now but didn't find a real root cause

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

hiteshghia commented 2 years ago

Hi I am facing this same issue right now, added the AWS_REGION AND AWS_DEFAULT_REGION manually but still see this in the cni-helper pod :


{"level":"info","ts":"2022-09-21T01:45:08.596Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=6)"}
{"level":"error","ts":"2022-09-21T01:47:18.033Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-sx47t:61678)"}
{"level":"error","ts":"2022-09-21T01:49:29.105Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-gdtcw:61678)"}
{"level":"error","ts":"2022-09-21T01:51:40.177Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xbc9b:61678)"}
{"level":"error","ts":"2022-09-21T01:53:51.249Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ktpbn:61678)"}```

Any help greatly appreciated.

jdn5126 commented 2 years ago

Hi @hiteshghia , are you still facing this issue?

taer commented 3 weeks ago

If this project is still around, I seem to be having the same issue.

{"level":"info","ts":"2024-10-14T20:06:11.547Z","caller":"cni-metrics-helper/main.go:69","msg":"Constructed new logger instance"}                                                          
{"level":"info","ts":"2024-10-14T20:06:11.548Z","caller":"runtime/proc.go:271","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: false, Prometheus: true, LogLevel DEBUG, me
tricUpdateInterval 30"}                                                                                                                                                                    
{"level":"info","ts":"2024-10-14T20:06:41.588Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:08:51.287Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the 
request (get pods aws-node-n929t:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:11:02.359Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the 
request (get pods aws-node-xlz6m:61678)"}

installed cni-helper via helm chart, with the intent to scrape via prometheus

aws / amazon-vpc-cni-k8s

CNI metrics helper v1.10.2 is unable to scrape metrics from aws-node #1912

This already has some duplicates in deployment code:

Source: cni-metrics-helper/templates/deployment.yaml

Optional: Should be ClusterName/ClusterIdentifier used as the metric dimension

⚠️COMMENT VISIBILITY WARNING⚠️