aws / eks-charts

Amazon EKS Helm chart repository
Apache License 2.0
1.19k stars 951 forks source link

CloudWatch Agent Fails on EKS when IMDS is Restricted According to Best Practices #517

Open fitchtech opened 3 years ago

fitchtech commented 3 years ago

When deploying the aws-cloudwatch-metric chart version 0.0.4 with image.tag 1.247347.6b250880 and IRSA mapped to arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy I get the following error in the DaemonSet logs. IMDS is disabled per best practices as I'm using RBAC with IRSA. I'm not sure if this is an actual issue or can be safely ignored.

Logs:

2021/05/04 16:39:28 I! 2021/05/04 16:39:25 E! ec2metadata is not available 2021/05/04 16:39:25 I! attempt to access ECS task metadata to determine whether I'm running in ECS. 2021/05/04 16:39:26 W! retry [0/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 2021/05/04 16:39:27 W! retry [1/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 2021/05/04 16:39:28 W! retry [2/3], unable to get http response from http://169.254.170.2/v2/metadata, error: unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 2021/05/04 16:39:28 I! access ECS task metadata fail with response unable to get response from http://169.254.170.2/v2/metadata, error: Get "http://169.254.170.2/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS. I! Detected the instance is OnPrem 2021/05/04 16:39:28 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ... /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it. 2021/05/04 16:39:28 Reading json config file path: /etc/cwagentconfig/..2021_05_04_16_39_17.050069089/cwagentconfig.json ... 2021/05/04 16:39:28 Find symbolic link /etc/cwagentconfig/..data 2021/05/04 16:39:28 Find symbolic link /etc/cwagentconfig/cwagentconfig.json 2021/05/04 16:39:28 Reading json config file path: /etc/cwagentconfig/cwagentconfig.json ... Valid Json input schema. Got Home directory: /root Got Home directory: /root I! Set home dir Linux: /root I! SDKRegionWithCredsMap region: us-west-2 No csm configuration found. No metric configuration found. Configuration validation first phase succeeded

2021/05/04 16:39:28 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 2021-05-04T16:39:28Z I! Starting AmazonCloudWatchAgent 1.247347.6 2021-05-04T16:39:28Z I! Loaded inputs: cadvisor k8sapiserver 2021-05-04T16:39:28Z I! Loaded aggregators: 2021-05-04T16:39:28Z I! Loaded processors: ec2tagger k8sdecorator 2021-05-04T16:39:28Z I! Loaded outputs: cloudwatchlogs 2021-05-04T16:39:28Z I! Tags enabled: 2021-05-04T16:39:28Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-0-6-146.us-west-2.compute.internal", Flush Interval:1s 2021-05-04T16:39:28Z I! [logagent] starting 2021-05-04T16:39:28Z I! [logagent] found plugin cloudwatchlogs is a log backend

fitchtech commented 3 years ago

This is still a problem in the aws-cloudwatch-metrics 0.0.5 helm chart. If you restrict access to the instance metadata service (IMDS) per the EKS best practices documentation.

By setting hostNetwork = true in the cloudwatch agent daemonset it then can then access the metadata service so the agent can start.

However, that shouldn't be necessary if the agent is using the proper credential chain for assuming a role via the Kubernetes service account with annotations for IAM Roles for Service Accounts (IRSA).

bcelenza commented 2 years ago

I ran into this while trying to get the agent running on a pure fargate cluster w/ IRSA.

It looks like the agent is using it's own credentials chain, which does not include support for IRSA: https://github.com/aws/amazon-cloudwatch-agent/issues/308

claudio-vellage commented 2 years ago

I'm not sure, for me it doesn't seem to pickup the service account at all, it might be a different issue, but for some reason it tries to always use the account from the node:

pod

   ...
   serviceAccount: aws-cloudwatch-metrics
   serviceAccountName: aws-cloudwatch-metrics
   ...

sa

  metadata:
   annotations:
     eks.amazonaws.com/role-arn: arn:aws:iam::*REDACTED*:role/AmazonEKSCloudWatchMetricsRole
[outputs.cloudwatchlogs] Aws error received when sending logs to /aws/containerinsights/*REDACTED*/performance/*REDACTED*: AccessDeniedException: User: arn:aws:sts::*REDACTED*:assumed-role/eksNodeRole/*REDACTED* is not authorized to perform: logs:PutLogEvents on resource: arn:aws:logs:us-east-1:*REDACTED*:log-gr status code: 400, request id: *REDACTED*

It's somehow trying to use the eksNodeRole (assigned to the role), not the service account, not sure why? I'm using the same approach as for all other applications, where it's working flawlessly.

itforgeuk commented 2 years ago

Any update with this? Should we downgrade cwagent version?

all4innov commented 2 years ago

any update?

hbouaziz commented 2 years ago

Happy 1-year bugversairy :)

mkirlin commented 1 year ago

Hi! We've run into this issue in our cluster, so I just wanted to bump this again. We're raising a ticket with our AWS rep as well.

all4innov commented 1 year ago

some new updates ?

z9fr commented 1 year ago

any update ?

adrianmkng commented 1 year ago

By default instance_metadata_tags is disabled on EC2 instances so you can't query the instance metadata within the instance itself.

To make things more interesting there's a bug with EKS where you can't actually enable this for EKS nodes (see: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1785). I don't think there is a work around for this at the moment.