aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.28k stars 741 forks source link

CNI metrics helper v1.10.2 is unable to scrape metrics from aws-node #1912

Closed youwalther65 closed 2 years ago

youwalther65 commented 2 years ago

What happened: CNI helper pod is running but not able to scape metrics from aws-node pods $ k get clusterrole cni-metrics-helper NAME CREATED AT cni-metrics-helper 2022-03-07T18:37:14Z

$ k get clusterrolebinding cni-metrics-helper NAME ROLE AGE cni-metrics-helper ClusterRole/cni-metrics-helper 23m

$ k get deploy -n kube-system cni-metrics-helper NAME READY UP-TO-DATE AVAILABLE AGE cni-metrics-helper 1/1 1 1 21m

$ k get deploy -n kube-system cni-metrics-helper -o yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2022-03-07T18:37:14Z" generation: 1 labels: k8s-app: cni-metrics-helper kustomize.toolkit.fluxcd.io/name: flux-infrastructure kustomize.toolkit.fluxcd.io/namespace: flux-system name: cni-metrics-helper namespace: kube-system resourceVersion: "119599" uid: 1589a869-2cbc-439d-9b8d-6f7d9ee693f8 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: cni-metrics-helper strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: k8s-app: cni-metrics-helper spec: containers:

Attach logs $ k logs -n kube-system cni-metrics-helper-5dff487d97-q2n6d ... {"level":"debug","ts":"2022-03-07T18:38:01.245Z","caller":"metrics/metrics.go:261","msg":"Reset detected resetDetected: false, noPreviousDataPoint: true, noCurrentDataPoint: false"} {"level":"error","ts":"2022-03-07T18:40:11.293Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ns559:61678)"} {"level":"error","ts":"2022-03-07T18:42:22.365Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-w9qnr:61678)"} {"level":"error","ts":"2022-03-07T18:44:33.437Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-jqq92:61678)"} {"level":"info","ts":"2022-03-07T18:44:33.437Z","caller":"runtime/proc.go:255","msg":"Collecting metrics ..."} {"level":"info","ts":"2022-03-07T18:44:33.437Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=4)"} {"level":"error","ts":"2022-03-07T18:46:44.508Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ns559:61678)"} {"level":"debug","ts":"2022-03-07T18:46:44.519Z","caller":"metrics/metrics.go:382","msg":"cni-metrics text output: # HELP awscni_add_ip_req_count The number of add IP address requests\n# TYPE awscni_add_ip_req_count counter\nawscni_add_ip_req_count 0\n# HELP awscni_assigned_ip_addresses The number of IP addresses assigned to pods\n# TYPE awscni_assigned_ip_addresses gauge\nawscni_assigned_ip_addresses 0\n# HELP awscni_aws_api_latency_ms AWS API call latency in ms\n# TYPE awscni_aws_api_latency_ms summary\nawscni_aws_api_latency_ms_sum{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 278\nawscni_aws_api_latency_ms_count{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 1\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"false\",status=\"200\"} 640\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"false\",status=\"200\"} 3191\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"true\",status=\"404\"} 53\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"true\",status=\"404\"} 319\nawscni_aws_api_latency_ms_sum{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 380\nawscni_aws_api_latency_ms_count{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 1\n# HELP awscni_build_info A metric with a constant '1' value labeled by version, revision, and goversion from which amazon-vpc-cni-k8s was built.\n# TYPE awscni_build_info gauge\nawscni_build_info{goversion=\"go1.16.10\",version=\"\"} 1\n# HELP awscni_eni_allocated The number of ENIs allocated\n# TYPE awscni_eni_allocated gauge\nawscni_eni_allocated 1\n# HELP awscni_eni_max The maximum number of ENIs that can be attached to the instance, accounting for unmanaged ENIs\n# TYPE awscni_eni_max gauge\nawscni_eni_max 3\n# HELP awscni_force_removed_enis The number of ENIs force removed while they had assigned pods\n# TYPE awscni_force_removed_enis counter\nawscni_force_removed_enis 0\n# HELP awscni_force_removed_ips The number of IPs force removed while they had assigned pods\n# TYPE awscni_force_removed_ips counter\nawscni_force_removed_ips 0\n# HELP awscni_ip_max The maximum number of IP addresses that can be allocated to the instance\n# TYPE awscni_ip_max gauge\nawscni_ip_max 15\n# HELP awscni_ipamd_action_inprogress The number of ipamd actions in progress\n# TYPE awscni_ipamd_action_inprogress gauge\nawscni_ipamd_action_inprogress{fn=\"nodeIPPoolReconcile\"} 0\nawscni_ipamd_action_inprogress{fn=\"nodeInit\"} 0\n# HELP awscni_reconcile_count The number of times ipamd reconciles on ENIs and IP/Prefix addresses\n# TYPE awscni_reconcile_count counter\nawscni_reconcile_count{fn=\"eniDataStorePoolReconcileAdd\"} 1585\n# HELP awscni_total_ip_addresses The total number of IP addresses\n# TYPE awscni_total_ip_addresses gauge\nawscni_total_ip_addresses 5\n# HELP awscni_total_ipv4_prefixes The total number of IPv4 prefixes\n# TYPE awscni_total_ipv4_prefixes gauge\nawscni_total_ipv4_prefixes 0\n# HELP go_gc_duration_seconds A summary of the GC invocation durations.\n# TYPE go_gc_duration_seconds summary\ngo_gc_duration_seconds{quantile=\"0\"} 3.2051e-05\ngo_gc_duration_seconds{quantile=\"0.25\"} 4.746e-05\ngo_gc_duration_seconds{quantile=\"0.5\"} 5.3798e-05\ngo_gc_duration_seconds{quantile=\"0.75\"} 7.3225e-05\ngo_gc_duration_seconds{quantile=\"1\"} 0.001240274\ngo_gc_duration_seconds_sum 0.011943986\ngo_gc_duration_seconds_count 163\n# HELP go_goroutines Number of goroutines that currently exist.\n# TYPE go_goroutines gauge\ngo_goroutines 37\n# HELP go_info Information about the Go environment.\n# TYPE go_info gauge\ngo_info{version=\"go1.16.10\"} 1\n# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.\n# TYPE go_memstats_alloc_bytes gauge\ngo_memstats_alloc_bytes 5.749584e+06\n# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.\n# TYPE go_memstats_alloc_bytes_total counter\ngo_memstats_alloc_bytes_total 5.1863536e+08\n# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.\n# TYPE go_memstats_buck_hash_sys_bytes gauge\ngo_memstats_buck_hash_sys_bytes 1.490576e+06\n# HELP go_memstats_frees_total Total number of frees.\n# TYPE go_memstats_frees_total counter\ngo_memstats_frees_total 1.529438e+06\n# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.\n# TYPE go_memstats_gc_cpu_fraction gauge\ngo_memstats_gc_cpu_fraction 2.9108407286280238e-06\n# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.\n# TYPE go_memstats_gc_sys_bytes gauge\ngo_memstats_gc_sys_bytes 5.616304e+06\n# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.\n# TYPE go_memstats_heap_alloc_bytes gauge\ngo_memstats_heap_alloc_bytes 5.749584e+06\n# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.\n# TYPE go_memstats_heap_idle_bytes gauge\ngo_memstats_heap_idle_bytes 5.8212352e+07\n# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.\n# TYPE go_memstats_heap_inuse_bytes gauge\ngo_memstats_heap_inuse_bytes 8.208384e+06\n# HELP go_memstats_heap_objects Number of allocated objects.\n# TYPE go_memstats_heap_objects gauge\ngo_memstats_heap_objects 29446\n# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.\n# TYPE go_memstats_heap_released_bytes gauge\ngo_memstats_heap_released_bytes 5.5681024e+07\n# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.\n# TYPE go_memstats_heap_sys_bytes gauge\ngo_memstats_heap_sys_bytes 6.6420736e+07\n# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.\n# TYPE go_memstats_last_gc_time_seconds gauge\ngo_memstats_last_gc_time_seconds 1.6466787616502016e+09\n# HELP go_memstats_lookups_total Total number of pointer lookups.\n# TYPE go_memstats_lookups_total counter\ngo_memstats_lookups_total 0\n# HELP go_memstats_mallocs_total Total number of mallocs.\n# TYPE go_memstats_mallocs_total counter\ngo_memstats_mallocs_total 1.558884e+06\n# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.\n# TYPE go_memstats_mcache_inuse_bytes gauge\ngo_memstats_mcache_inuse_bytes 2400\n# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.\n# TYPE go_memstats_mcache_sys_bytes gauge\ngo_memstats_mcache_sys_bytes 16384\n# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.\n# TYPE go_memstats_mspan_inuse_bytes gauge\ngo_memstats_mspan_inuse_bytes 119952\n# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.\n# TYPE go_memstats_mspan_sys_bytes gauge\ngo_memstats_mspan_sys_bytes 147456\n# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.\n# TYPE go_memstats_next_gc_bytes gauge\ngo_memstats_next_gc_bytes 8.87776e+06\n# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.\n# TYPE go_memstats_other_sys_bytes gauge\ngo_memstats_other_sys_bytes 676552\n# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.\n# TYPE go_memstats_stack_inuse_bytes gauge\ngo_memstats_stack_inuse_bytes 688128\n# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.\n# TYPE go_memstats_stack_sys_bytes gauge\ngo_memstats_stack_sys_bytes 688128\n# HELP go_memstats_sys_bytes Number of bytes obtained from system.\n# TYPE go_memstats_sys_bytes gauge\ngo_memstats_sys_bytes 7.5056136e+07\n# HELP go_threads Number of OS threads created.\n# TYPE go_threads gauge\ngo_threads 8\n# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.\n# TYPE process_cpu_seconds_total counter\nprocess_cpu_seconds_total 7.82\n# HELP process_max_fds Maximum number of open file descriptors.\n# TYPE process_max_fds gauge\nprocess_max_fds 1.048576e+06\n# HELP process_open_fds Number of open file descriptors.\n# TYPE process_open_fds gauge\nprocess_open_fds 20\n# HELP process_resident_memory_bytes Resident memory size in bytes.\n# TYPE process_resident_memory_bytes gauge\nprocess_resident_memory_bytes 5.7962496e+07\n# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.\n# TYPE process_start_time_seconds gauge\nprocess_start_time_seconds 1.64665979817e+09\n# HELP process_virtual_memory_bytes Virtual memory size in bytes.\n# TYPE process_virtual_memory_bytes gauge\nprocess_virtual_memory_bytes 7.78473472e+08\n# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.\n# TYPE process_virtual_memory_max_bytes gauge\nprocess_virtual_memory_max_bytes -1\n# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.\n# TYPE promhttp_metric_handler_requests_in_flight gauge\npromhttp_metric_handler_requests_in_flight 1\n# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.\n# TYPE promhttp_metric_handler_requests_total counter\npromhttp_metric_handler_requests_total{code=\"200\"} 1\npromhttp_metric_handler_requests_total{code=\"500\"} 0\npromhttp_metric_handler_requests_total{code=\"503\"} 0\n"}

ServiceAccount is using IRSA $ k get sa -n kube-system cni-metrics-helper -o yaml | head -6 apiVersion: v1 kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam:::role/AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4 creationTimestamp: "2022-03-07T18:37:14Z"

$ aws iam get-role --role-name AmazonEKSVPCCNIMe tricsHelperRole-git-eks-demo-ipv4 { "Role": { "Path": "/", "RoleName": "AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4", "RoleId": "AROAZAC4CGT7ZTEGU53VD", "Arn": "arn:aws:iam:::role/AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4", "CreateDate": "2022-03-07T17:27:29Z", "AssumeRolePolicyDocument": { "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam:::oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.eu-west-1.amazonaws.com/id/:sub": "system:serviceaccount:kube-system:cni-metrics-helper" } } } ]

Proper policy is attached: $ aws iam list-attached-role-policies --role-name AmazonEKSVPCCNIMetricsHelperRole-git-eks-demo-ipv4 { "AttachedPolicies": [ { "PolicyName": "AmazonEKSVPCCNIMetricsHelperPolicy-git-eks-demo-ipv4", "PolicyArn": "arn:aws:iam:::policy/AmazonEKSVPCCNIMetricsHelperPolicy-git-eks-demo-ipv4" } ] }

Instances have following IMDS settings "MetadataOptions": { "State": "applied", "HttpTokens": "required", "HttpPutResponseHopLimit": 2, "HttpEndpoint": "enabled", "HttpProtocolIpv6": "disabled" },

What you expected to happen: Scrape CNI metrics from aws-node pods and publish to CloudWatch

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

cgchinmay commented 2 years ago

Will check and get back to you soon.

cgchinmay commented 2 years ago

If you were using IRSA then the region field should have been auto- injected. I am not sure why you don't see it in your deployment spec for cni-metrics-helper. Will check your cluster setup. Could you share your cluster-arn to k8s-awscni-triage@amazon.com Meanwhile

  1. Could you check cni-metrics-helper logs, it should display at the top what region and cluster id values are being used
  2. Could you manually try to add AWS_REGION as an env var in your cni-metrics helper deployment spec
- name: AWS_REGION
value: <your region>

Thanks

youwalther65 commented 2 years ago

I followed installation instructions from: https://docs.aws.amazon.com/eks/latest/userguide/cni-metrics-helper.html

This points to the following YAML for SA, RBAC and deployment: $ curl -o cni-metrics-helper.yaml https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.10/config/master/cni-metrics-helper.yaml

This already has some duplicates in deployment code:

Source: cni-metrics-helper/templates/deployment.yaml

kind: Deployment apiVersion: apps/v1 metadata: name: cni-metrics-helper namespace: kube-system labels: k8s-app: cni-metrics-helper spec: selector: matchLabels: k8s-app: cni-metrics-helper template: metadata: labels: k8s-app: cni-metrics-helper spec: containers:

I will substiute here AWS_REGION for the second AWS_CLUSTER_ID and check

cgchinmay commented 2 years ago

Could you check this Readme : https://github.com/aws/amazon-vpc-cni-k8s/tree/master/cmd/cni-metrics-helper Also share your cluster-arn to: k8s-awscni-triage@amazon.com . This will help me inspect your deployment spec and service accounts. Thanks

youwalther65 commented 2 years ago

Same HTTP 503 messages but I can confirm that region is used in deployment now. $ k logs -n kube-system cni-metrics-helper-75cb84c9f8-r2wgn {"level":"info","ts":"2022-03-08T07:15:30.793Z","caller":"runtime/proc.go:255","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: true, LogLevel Debug"} I0308 07:15:31.850529 1 request.go:621] Throttling request took 1.037923334s, request: GET:https://172.20.0.1:443/apis/kustomize.toolkit.fluxcd.io/v1beta1?timeout=32s {"level":"info","ts":"2022-03-08T07:15:38.815Z","caller":"cni-metrics-helper/main.go:113","msg":"Using REGION=eu-west-1 and CLUSTER_ID=git-eks-demo-ipv4"} {"level":"info","ts":"2022-03-08T07:16:08.816Z","caller":"runtime/proc.go:255","msg":"Collecting metrics ..."} {"level":"info","ts":"2022-03-08T07:16:08.916Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=4)"} {"level":"debug","ts":"2022-03-08T07:16:08.922Z","caller":"metrics/metrics.go:382","msg":"cni-metrics text output: # HELP awscni_add_ip_req_count The number of add IP address requests\n# TYPE awscni_add_ip_req_count counter\nawscni_add_ip_req_count 0\n# HELP awscni_assigned_ip_addresses The number of IP addresses assigned to pods\n# TYPE awscni_assigned_ip_addresses gauge\nawscni_assigned_ip_addresses 0\n# HELP awscni_aws_api_latency_ms AWS API call latency in ms\n# TYPE awscni_aws_api_latency_ms summary\nawscni_aws_api_latency_ms_sum{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 278\nawscni_aws_api_latency_ms_count{api=\"DescribeNetworkInterfaces\",error=\"false\",status=\"200\"} 1\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"false\",status=\"200\"} 1789\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"false\",status=\"200\"} 10683\nawscni_aws_api_latency_ms_sum{api=\"GetMetadata\",error=\"true\",status=\"404\"} 166\nawscni_aws_api_latency_ms_count{api=\"GetMetadata\",error=\"true\",status=\"404\"} 1068\nawscni_aws_api_latency_ms_sum{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 380\nawscni_aws_api_latency_ms_count{api=\"ModifyNetworkInterfaceAttribute\",error=\"false\",status=\"200\"} 1\n# HELP awscni_build_info A metric with a constant '1' value labeled by version, revision, and goversion from which amazon-vpc-cni-k8s was built.\n# TYPE awscni_build_info gauge\nawscni_build_info{goversion=\"go1.16.10\",version=\"\"} 1\n# HELP awscni_eni_allocated The number of ENIs allocated\n# TYPE awscni_eni_allocated gauge\nawscni_eni_allocated 1\n# HELP awscni_eni_max The maximum number of ENIs that can be attached to the instance, accounting for unmanaged ENIs\n# TYPE awscni_eni_max gauge\nawscni_eni_max 3\n# HELP awscni_force_removed_enis The number of ENIs force removed while they had assigned pods\n# TYPE awscni_force_removed_enis counter\nawscni_force_removed_enis 0\n# HELP awscni_force_removed_ips The number of IPs force removed while they had assigned pods\n# TYPE awscni_force_removed_ips counter\nawscni_force_removed_ips 0\n# HELP awscni_ip_max The maximum number of IP addresses that can be allocated to the instance\n# TYPE awscni_ip_max gauge\nawscni_ip_max 15\n# HELP awscni_ipamd_action_inprogress The number of ipamd actions in progress\n# TYPE awscni_ipamd_action_inprogress gauge\nawscni_ipamd_action_inprogress{fn=\"nodeIPPoolReconcile\"} 0\nawscni_ipamd_action_inprogress{fn=\"nodeInit\"} 0\n# HELP awscni_reconcile_count The number of times ipamd reconciles on ENIs and IP/Prefix addresses\n# TYPE awscni_reconcile_count counter\nawscni_reconcile_count{fn=\"eniDataStorePoolReconcileAdd\"} 5330\n# HELP awscni_total_ip_addresses The total number of IP addresses\n# TYPE awscni_total_ip_addresses gauge\nawscni_total_ip_addresses 5\n# HELP awscni_total_ipv4_prefixes The total number of IPv4 prefixes\n# TYPE awscni_total_ipv4_prefixes gauge\nawscni_total_ipv4_prefixes 0\n# HELP go_gc_duration_seconds A summary of the GC invocation durations.\n# TYPE go_gc_duration_seconds summary\ngo_gc_duration_seconds{quantile=\"0\"} 4.0646e-05\ngo_gc_duration_seconds{quantile=\"0.25\"} 5.2208e-05\ngo_gc_duration_seconds{quantile=\"0.5\"} 7.3721e-05\ngo_gc_duration_seconds{quantile=\"0.75\"} 0.000102977\ngo_gc_duration_seconds{quantile=\"1\"} 0.001945569\ngo_gc_duration_seconds_sum 0.061231686\ngo_gc_duration_seconds_count 544\n# HELP go_goroutines Number of goroutines that currently exist.\n# TYPE go_goroutines gauge\ngo_goroutines 37\n# HELP go_info Information about the Go environment.\n# TYPE go_info gauge\ngo_info{version=\"go1.16.10\"} 1\n# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.\n# TYPE go_memstats_alloc_bytes gauge\ngo_memstats_alloc_bytes 4.73988e+06\n# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.\n# TYPE go_memstats_alloc_bytes_total counter\ngo_memstats_alloc_bytes_total 1.817476792e+09\n# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.\n# TYPE go_memstats_buck_hash_sys_bytes gauge\ngo_memstats_buck_hash_sys_bytes 1.54928e+06\n# HELP go_memstats_frees_total Total number of frees.\n# TYPE go_memstats_frees_total counter\ngo_memstats_frees_total 4.886522e+06\n# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.\n# TYPE go_memstats_gc_cpu_fraction gauge\ngo_memstats_gc_cpu_fraction 5.9174801745196725e-06\n# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.\n# TYPE go_memstats_gc_sys_bytes gauge\ngo_memstats_gc_sys_bytes 5.626544e+06\n# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.\n# TYPE go_memstats_heap_alloc_bytes gauge\ngo_memstats_heap_alloc_bytes 4.73988e+06\n# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.\n# TYPE go_memstats_heap_idle_bytes gauge\ngo_memstats_heap_idle_bytes 5.9006976e+07\n# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.\n# TYPE go_memstats_heap_inuse_bytes gauge\ngo_memstats_heap_inuse_bytes 7.479296e+06\n# HELP go_memstats_heap_objects Number of allocated objects.\n# TYPE go_memstats_heap_objects gauge\ngo_memstats_heap_objects 26010\n# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.\n# TYPE go_memstats_heap_released_bytes gauge\ngo_memstats_heap_released_bytes 5.636096e+07\n# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.\n# TYPE go_memstats_heap_sys_bytes gauge\ngo_memstats_heap_sys_bytes 6.6486272e+07\n# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.\n# TYPE go_memstats_last_gc_time_seconds gauge\ngo_memstats_last_gc_time_seconds 1.6467237672096214e+09\n# HELP go_memstats_lookups_total Total number of pointer lookups.\n# TYPE go_memstats_lookups_total counter\ngo_memstats_lookups_total 0\n# HELP go_memstats_mallocs_total Total number of mallocs.\n# TYPE go_memstats_mallocs_total counter\ngo_memstats_mallocs_total 4.912532e+06\n# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.\n# TYPE go_memstats_mcache_inuse_bytes gauge\ngo_memstats_mcache_inuse_bytes 2400\n# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.\n# TYPE go_memstats_mcache_sys_bytes gauge\ngo_memstats_mcache_sys_bytes 16384\n# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.\n# TYPE go_memstats_mspan_inuse_bytes gauge\ngo_memstats_mspan_inuse_bytes 119544\n# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.\n# TYPE go_memstats_mspan_sys_bytes gauge\ngo_memstats_mspan_sys_bytes 147456\n# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.\n# TYPE go_memstats_next_gc_bytes gauge\ngo_memstats_next_gc_bytes 9.173072e+06\n# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.\n# TYPE go_memstats_other_sys_bytes gauge\ngo_memstats_other_sys_bytes 607608\n# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.\n# TYPE go_memstats_stack_inuse_bytes gauge\ngo_memstats_stack_inuse_bytes 622592\n# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.\n# TYPE go_memstats_stack_sys_bytes gauge\ngo_memstats_stack_sys_bytes 622592\n# HELP go_memstats_sys_bytes Number of bytes obtained from system.\n# TYPE go_memstats_sys_bytes gauge\ngo_memstats_sys_bytes 7.5056136e+07\n# HELP go_threads Number of OS threads created.\n# TYPE go_threads gauge\ngo_threads 8\n# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.\n# TYPE process_cpu_seconds_total counter\nprocess_cpu_seconds_total 26.97\n# HELP process_max_fds Maximum number of open file descriptors.\n# TYPE process_max_fds gauge\nprocess_max_fds 1.048576e+06\n# HELP process_open_fds Number of open file descriptors.\n# TYPE process_open_fds gauge\nprocess_open_fds 20\n# HELP process_resident_memory_bytes Resident memory size in bytes.\n# TYPE process_resident_memory_bytes gauge\nprocess_resident_memory_bytes 5.7884672e+07\n# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.\n# TYPE process_start_time_seconds gauge\nprocess_start_time_seconds 1.64665979817e+09\n# HELP process_virtual_memory_bytes Virtual memory size in bytes.\n# TYPE process_virtual_memory_bytes gauge\nprocess_virtual_memory_bytes 7.78473472e+08\n# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.\n# TYPE process_virtual_memory_max_bytes gauge\nprocess_virtual_memory_max_bytes -1\n# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.\n# TYPE promhttp_metric_handler_requests_in_flight gauge\npromhttp_metric_handler_requests_in_flight 1\n# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.\n# TYPE promhttp_metric_handler_requests_total counter\npromhttp_metric_handler_requests_total{code=\"200\"} 116\npromhttp_metric_handler_requests_total{code=\"500\"} 0\npromhttp_metric_handler_requests_total{code=\"503\"} 0\n"} {"level":"debug","ts":"2022-03-08T07:16:08.923Z","caller":"metrics/metrics.go:261","msg":"Reset detected resetDetected: false, noPreviousDataPoint: true, noCurrentDataPoint: false"}

Interesting to note that now the AWS_DEFAULT_REGION env is not inserted.

spec: containers:

I will check the GitHub docu and send email in a minute.

youwalther65 commented 2 years ago

Interesting to see why IRSA does not inject bot AWS_REGION and AWS_DEFAULT_REGION. Did it manually in delyment and now NTH works: $ k logs -n kube-system aws-node-termination-handler-6f846dcb79-rm6hl 2022/03/07 15:43:17 INF Starting to serve handler /healthz, port 8080 2022/03/07 15:43:17 INF Startup Metadata Retrieved metadata={"accountId":"","availabilityZone":"eu-west-1a","instanceId":"i-0xxx","instanceLifeCycle":"on-demand","instanceType":"t3.large","localHostname":"ip-xxx.eu-west-1.compute.internal","privateIp":"10.0.x.y","publicHostname":"","publicIp":"","region":"eu-west-1"} 2022/03/07 15:43:17 INF aws-node-termination-handler arguments: dry-run: false, node-name: ip-10-0-1-200.eu-west-1.compute.internal, metadata-url: http://169.254.169.254, kubernetes-service-host: 172.20.0.1, kubernetes-service-port: 443, delete-local-data: true, ignore-daemon-sets: true, pod-termination-grace-period: -1, node-termination-grace-period: 120, enable-scheduled-event-draining: false, enable-spot-interruption-draining: false, enable-sqs-termination-draining: true, enable-rebalance-monitoring: false, enable-rebalance-draining: false, metadata-tries: 3, cordon-only: false, taint-node: false, taint-effect: NoSchedule, json-logging: false, log-level: info, webhook-proxy: , webhook-headers: , webhook-url: , webhook-template: , uptime-from-file: , enable-prometheus-server: false, prometheus-server-port: 9092, emit-kubernetes-events: false, kubernetes-events-extra-annotations: , aws-region: eu-west-1, queue-url: https://sqs.eu-west-1.queue.amazonaws.com//git-eks-demo-ipv4-karpenter, check-asg-tag-before-draining: false, managed-asg-tag: aws-node-termination-handler/managed, assume-asg-tag-propagation: false, aws-endpoint: ,

2022/03/07 15:43:17 INF Started watching for interruption events 2022/03/07 15:43:17 INF Kubernetes AWS Node Termination Handler has started successfully! 2022/03/07 15:43:17 INF Started watching for event cancellations 2022/03/07 15:43:17 INF Started monitoring for events event_type=SQS_TERMINATE

Now I see metrics in CW.

cgchinmay commented 2 years ago

Hi @youwalther65 I followed the steps mentioned here: https://docs.aws.amazon.com/eks/latest/userguide/cni-metrics-helper.html . I tried in ap-east-1 region and was able to find cni-metrics-helper pod injected with AWS_REGION as well as AWS_DEFAULT_REGION fields. I created serviceaccount using eksctl (just fyi). I am not sure what went wrong in your case but you can give it a one more try with a fresh install. I also verified the metrics being published.

Note: there was duplicate AWS_CLUSTER_ID field in the manifest file. I am not sure if that could have anyway affected but fixing it.

AWS_CLUSTER_ID:               test
USE_CLOUDWATCH:               true
AWS_DEFAULT_REGION:           ap-east-1
AWS_REGION:                   ap-east-1

Will wait for your response before we can close the issue.

youwalther65 commented 2 years ago

It woks now but didn't find a real root cause

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

hiteshghia commented 2 years ago

Hi I am facing this same issue right now, added the AWS_REGION AND AWS_DEFAULT_REGION manually but still see this in the cni-helper pod :


{"level":"info","ts":"2022-09-21T01:45:08.596Z","caller":"metrics/cni_metrics.go:195","msg":"Total aws-node pod count:- %!(EXTRA int=6)"}
{"level":"error","ts":"2022-09-21T01:47:18.033Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-sx47t:61678)"}
{"level":"error","ts":"2022-09-21T01:49:29.105Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-gdtcw:61678)"}
{"level":"error","ts":"2022-09-21T01:51:40.177Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-xbc9b:61678)"}
{"level":"error","ts":"2022-09-21T01:53:51.249Z","caller":"metrics/metrics.go:382","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the request (get pods aws-node-ktpbn:61678)"}```

Any help greatly appreciated.
jdn5126 commented 2 years ago

Hi @hiteshghia , are you still facing this issue?

taer commented 3 weeks ago

If this project is still around, I seem to be having the same issue.

{"level":"info","ts":"2024-10-14T20:06:11.547Z","caller":"cni-metrics-helper/main.go:69","msg":"Constructed new logger instance"}                                                          
{"level":"info","ts":"2024-10-14T20:06:11.548Z","caller":"runtime/proc.go:271","msg":"Starting CNIMetricsHelper. Sending metrics to CloudWatch: false, Prometheus: true, LogLevel DEBUG, me
tricUpdateInterval 30"}                                                                                                                                                                    
{"level":"info","ts":"2024-10-14T20:06:41.588Z","caller":"runtime/proc.go:271","msg":"Collecting metrics ..."}                                                                             
{"level":"info","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/cni_metrics.go:211","msg":"Total aws-node pod count: 5"}                                                                 
{"level":"debug","ts":"2024-10-14T20:06:41.689Z","caller":"metrics/metrics.go:439","msg":"Total TargetList pod count: 5"}                                                                  
{"level":"error","ts":"2024-10-14T20:08:51.287Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the 
request (get pods aws-node-n929t:61678)"}                                                                                                                                                  
{"level":"error","ts":"2024-10-14T20:11:02.359Z","caller":"metrics/metrics.go:399","msg":"grabMetricsFromTarget: Failed to grab CNI endpoint: the server is currently unable to handle the 
request (get pods aws-node-xlz6m:61678)"}    

installed cni-helper via helm chart, with the intent to scrape via prometheus