aws / amazon-cloudwatch-agent

CloudWatch Agent enables you to collect and export host-level metrics and logs on instances running Linux or Windows server.
MIT License
433 stars 195 forks source link

amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials #1101

Open ecerulm opened 5 months ago

ecerulm commented 5 months ago

Describe the bug When IDMSv2 is enabled in the worker nodes with hop limit 1 , the IMDSv2 is not accessible from the pods. In general, I don't want pods to access IMDS since they can get credentials for the node IAM role.

When IDMSv2 is no accessible it seems that the cloudagent (i'm using amazon-cloudwatch-observability eks addon) , tries to use credentials from the non existing file /root/.aws/credentials instead of using the credentials from IRSA. The pod uses a service account with IRSA annotation and it was the environment variables AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE (injected by IRSA).

But I believe the amazon-cloudwatch-agent is ignoring the IRSA credentials (I suspect it it's because IMDSv2 is not available , and the it decides it "onprem")

I see on startup of the pod

D! [EC2] Found active network interface
I! imds retry client will retry 1 timesD! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get hostname without imds v1 fallback enable thus enable fallback
E! [EC2] Fetch hostname from EC2 metadata fail: EC2MetadataError: failed to make EC2Metadata request
 status code: 401, request id: 
D! should retry true for imds error : RequestError: send request failed

...

I! Detected the instance is OnPremise

Steps to reproduce If possible, provide a recipe for reproducing the error.

EKS 1.29 EKS nodes 1.29 bottlerocket with IMDSv2 (http tokens required, hop limit 1) amazon-cloudwatch-observability eks addon v1.4.0-eksbuild.1 (default config)

What did you expect to see?

I expect to allow me to override "OnPrem" / "EC2" from the eks addon configuration , I don't see that as possibility in the amazon-cloudwatch-observability addon,

 aws eks describe-addon-configuration --addon-name amazon-cloudwatch-observability --addon-version "v1.4.0-eksbuild.1" --query configurationSchema | jq '.|fromjson'

What did you see instead?

I see that it detects "OnPremise" and I believe that in turn forces it to use /root/.aws/credentials when in fact it should be using the credentials from IRSA via the existing environment variables $AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE.

What version did you use? Version: (e.g., v1.247350.0, etc)

I'm using amazon-cloudwatch-observability eks addon version v1.4.0-eksbuild.1, I don't know which version of the amazon-cloudwatch-agent is indluded with that

What config did you use? Config: (e.g. the agent json config file)

Environment EKS 1.29 EKS nodes 1.29 bottlerocket with IMDSv2 (http tokens required, hop limit 1) amazon-cloudwatch-observability eks addon v1.4.0-eksbuild.1, (default config)

Additional context Add any other context about the problem here.

faizanshah-tp commented 5 months ago

facing the same while runing cloudwatch agent as a daemonset.

D! [EC2] Found active network interface
I! imds retry client will retry 1 timesD! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get hostname without imds v1 fallback enable thus enable fallback
E! [EC2] Fetch hostname from EC2 metadata fail: EC2MetadataError: failed to make EC2Metadata request

    status code: 401, request id: 
D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get instance document without imds v1 fallback enable thus enable fallback
ecerulm commented 5 months ago

I uninstalled the amazon-cloudwatch-observability eks add-on and installed using the the instructions at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-metrics.html and I'm getting the same result.

But I can set RUN_WITH_IRSA="True" (I actually have a service account with IRSA annotation) in the DaemonSet and that makes it detect

I! Detected from ENV RUN_WITH_IRSA is True

The RUN_WITH_IRSA environment variable does not seem to be documented but it's in the source code and it works the values needs to be True (not true or 1).

ecerulm commented 5 months ago

Although, using RUN_WITH_IRSA allows the DaemonSet to run and some metrics are sent to CloudWatch . I can see that it still tries to use the IMDS to get the EC2 metadata (instance id, image id, and instance type) I guess it's not possible to get those in any other way currently

The CloudWatch Container Insights still lacks the "Top 10 Nodes by CPU Utilization" , etc. I guess all metrics that have NodeName as dimension are missing.

I guess this means that amazon-cloudwatch-agent really needs IMDS , and maybe it should documented so. You can't have it without it, can you?


2024-03-25T14:14:10Z I! {"caller":"host/ec2metadata.go:78","msg":"Fetch instance id and type from ec2 metadata","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics"}
2024-03-25T14:14:11.425Z    DEBUG   aws@v0.0.0-20231208183748-c00ca1f62c3e/imdsretryer.go:45    imds error :    {"shouldRetry": true, "error": "RequestError: send request failed\ncaused by: Put \"http://169.254.169.254/latest/api/token\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
ecerulm commented 5 months ago

[Restrict the use of host networking and block access to instance metadata service ] sort of recommends blocking IMDS from the pods:

While these privileges are required for the node to operate effectively, it is not usually desirable that the pods running on the node inherit these privileges.

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

The alternatives are

[Restrict the use of host networking and block access to instance metadata service ]: https://docs.aws.amazon.com/whitepapers/latest/security-practices-multi-tenant-saas-applications-eks/restrict-the-use-of-host-networking-and-block-access-to-instance-metadata-service.html

jefchien commented 5 months ago

Hi @ecerulm, we are aware of the current IMDS requirement and are tracking an alternative for it when IMDS is unavailable internally.

AaronFriel commented 5 months ago

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com 

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName
ecerulm commented 5 months ago

@AaronFriel

Like I commented at https://github.com/aws/amazon-cloudwatch-agent/issues/1101#issuecomment-2018335759 even with RUN_WITH_IRSA it still goes to IMDS to obtain the instance id, etc

Although, using RUN_WITH_IRSA allows the DaemonSet to run and some metrics are sent to CloudWatch . I can see that it still tries to use the IMDS to get the EC2 metadata (instance id, image id, and instance type)

The instance id, etc are needed metrics for “ kubernetes container insights with enhanced observability (enhanced_container_insights)” and since they can’t be obtained those metric are not sent.

I don’t think there is anyway to pass the instance id, etc by any other means today but @jefchien seems to be indicating that there may be working in some alternative.

sbabalol commented 2 months ago

Any ETA on this fix?

ArthurMelin commented 1 month ago

I don’t think there is anyway to pass the instance id, etc by any other means today but @jefchien seems to be indicating that there may be working in some alternative.

Maybe the agent can grab the instance ID from the spec.providerID field on the Node object which it can be fetched from the kube-api?

Cornul11 commented 1 month ago

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com 

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

With this change applied to the agent, the issue still persists.

As mentioned in another comment, I created a custom launch template with an increased number of max hops, which solved the issue. I do understand, however, that this may be a security concern and should be avoided, but, as a temporary measure until the addon is fixed, it is acceptable for our use case.

mcujba commented 1 month ago

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com 

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

With this change applied to the agent, the issue still persists.

As mentioned in another comment, I created a custom launch template with an increased number of max hops, which solved the issue. I do understand, however, that this may be a security concern and should be avoided, but, as a temporary measure until the addon is fixed, it is acceptable for our use case.

The workroud is valid. Need to write the "True" value starting with uppercase.

kwangjong commented 1 month ago

I used this helm chart to deploy the add-on: https://github.com/aws-observability/helm-charts

Modifying amazon-cloudwatch-observability/templates/linux/cloudwatch-agent-daemonset.yaml like below solved the issue.

apiVersion: cloudwatch.aws.amazon.com/v1alpha1
kind: AmazonCloudWatchAgent
metadata:
  name: {{ template "cloudwatch-agent.name" . }}
  namespace: {{ .Release.Namespace }}
spec:
+ hostNetwork: true
  image: {{ template "cloudwatch-agent.image" . }}
  mode: daemonset
  ...
  env:
+ - name: RUN_WITH_IRSA
+   value: "True"
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  ...

Although this is not required, I configured Gatekeeper to restrict host network access exclusive to CloudWatch Agent pods for enhanced security.

contraintTemplate.yaml:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sallowedhostnetworking
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedHostNetworking
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedhostnetworking

        default allow = false

        allow {
          input.review.object.metadata.labels["app.kubernetes.io/name"] == "cloudwatch-agent"
        }

        violation[{"msg": msg}] {
          not allow
          input.review.object.spec.hostNetwork == true
          msg := "Host network is not allowed for this pod"
        }

constraint.yaml:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedHostNetworking
metadata:
  name: allowed-host-networking
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]