aws / amazon-cloudwatch-agent

CloudWatch Agent enables you to collect and export host-level metrics and logs on instances running Linux or Windows server.
MIT License
442 stars 201 forks source link

[k8s] Pod metrics is gone when using containerd as runtime #188

Open pingleig opened 3 years ago

pingleig commented 3 years ago

This is exported from internal ticket

TL;DR

The latest image is released, if you were using temp image from this comment https://github.com/aws/amazon-cloudwatch-agent/issues/188#issuecomment-803764697 please update to the latest tag.

If the error message W! No pod metric collected, metrics count is still 7 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188leads you to this issue

Background

We were relying on pause container to have POD for detecting pod, which is the case for docker but not for containerd https://github.com/containerd/cri/issues/922#issuecomment-423729537

User will not see pod metrics in container insight dashboard and they will find the following log which is introduced in #171

https://github.com/aws/amazon-cloudwatch-agent/blob/fbdd619269be7a00172e06992a8d40b22be1a6d7/plugins/inputs/cadvisor/container_info_processor.go#L72-L72

The root cause is we are expecting containerName == 'POD' to mark a path as pod

https://github.com/aws/amazon-cloudwatch-agent/blob/fbdd619269be7a00172e06992a8d40b22be1a6d7/plugins/inputs/cadvisor/container_info_processor.go#L119-L126

Fix

Release

The fix will be included in next release, the release date is not determined (yet).

pingleig commented 3 years ago

Created a temp image based on #189 public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1 (the latest official release now contains this fix) and the daemonset yaml need to be udpated to mount /run/containerd/containerd.sock

NOTE: If you are using bottlerocket on eks, the socket on host is different due to https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b You need to (and only need to) replace the volumes part to pick the right sock on host. (Full snippet is at end of comment).

      volumes:
       # ... 
        - name: containerdsock
          hostPath:
            # path: /run/containerd/containerd.sock
            # bottlerocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

Default containerd path

When host (and kubelet) is using /run/containerd/containerd.sock

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.0"
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            path: /run/containerd/containerd.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent

Non default containerd path

NOTE: You only need to change the volumes, when mount into cloudwatch agent container, you should still put it at default path.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      # aws eks update-kubeconfig --name eks-pod-metric-missing --region us-west-2
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu: 200m
              memory: 200Mi
            requests:
              cpu: 200m
              memory: 200Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.0"
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60
      serviceAccountName: cloudwatch-agent
pingleig commented 3 years ago

Another known issue is because we are using cadvisor, pod level filesystem usage is ignored

    "container_filesystem_available",
    "container_filesystem_capacity",
    "container_filesystem_usage",
    "container_filesystem_utilization"

https://github.com/google/cadvisor/blob/291c215c5ddc5216659b5e793a98a0ba9f104afb/container/containerd/handler.go#L163-L167

func (h *containerdContainerHandler) GetSpec() (info.ContainerSpec, error) {
    // TODO: Since we dont collect disk usage stats for containerd, we set hasFilesystem
    // to false. Revisit when we support disk usage stats for containerd
    hasFilesystem := false
    spec, err := common.GetSpec(h.cgroupPaths, h.machineInfoFactory, h.needNet(), hasFilesystem)
    spec.Labels = h.labels
    spec.Envs = h.envs
    spec.Image = h.image

    return spec, err
}
pingleig commented 3 years ago

NOTE: container file system usage is not provided after switching to containerd https://github.com/google/cadvisor/issues/2785

Created another issue to track the container filesystem metrics https://github.com/aws/amazon-cloudwatch-agent/issues/192

pingleig commented 3 years ago

Reopen this issue since we are still in the release process, and the official container insight public doc plus sample manifest is not updated yet.

pingleig commented 3 years ago

Close since the release is out

fitchtech commented 3 years ago

This needs fixed within the official helm charts for EKS https://github.com/aws/eks-charts/blob/master/stable/aws-cloudwatch-metrics/templates/daemonset.yaml

fitchtech commented 3 years ago

@pingleig I have tried applying the fix listed above exactly as is on EKS with the containerd runtime enabled. However, I'm still getting the same error messages:

2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded 2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes 2021-08-21T00:09:00Z W! No pod metric collected, metrics count is still 5 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188 2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49605661750447750614958043896578931231172344896032866930 2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 105.761168ms before retrying.

Support for containerd runtime on EKS was added in July when EKS 1.21 was released. https://aws.amazon.com/blogs/containers/amazon-eks-1-21-released/

pingleig commented 3 years ago

@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6

[grpc]
address = "/run/dockershim.sock"

You can follow non default path in https://github.com/aws/amazon-cloudwatch-agent/issues/188#issuecomment-803764697

          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.

fitchtech commented 3 years ago

@pingleig that worked, thank you. One additional change I had to make is to enable hostNetwork, cause the EC2 instances in my EKS 1.21 node group has the Instance MetaData Service (IMDS) restricted per the EKS security best practices . You have to set hostNetwork: true for it to be able to start up. Once I did everything loaded in the ContainerInsights console.

With hostNetwork: false I get this

2021/08/21 07:23:59 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 
2021-08-21T07:23:59Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:23:59Z I! Loaded inputs: k8sapiserver cadvisor
2021-08-21T07:23:59Z I! Loaded aggregators: 
2021-08-21T07:23:59Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:23:59Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:23:59Z I! Tags enabled: 
2021-08-21T07:23:59Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:23:59Z I! [logagent] starting
2021-08-21T07:23:59Z I! [logagent] found plugin cloudwatchlogs is a log backend

With hostNetwork: true

2021/08/21 07:28:18 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml 
2021-08-21T07:28:18Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:28:18Z I! Loaded inputs: cadvisor k8sapiserver
2021-08-21T07:28:18Z I! Loaded aggregators: 
2021-08-21T07:28:18Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:28:18Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:28:18Z I! Tags enabled: 
2021-08-21T07:28:18Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:28:18Z I! [logagent] starting
2021-08-21T07:28:18Z I! [logagent] found plugin cloudwatchlogs is a log backend
2021-08-21T07:28:18Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2021-08-21T07:28:18Z I! k8sapiserver Switch New Leader: ip-10-106-12-14.ec2.internal
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2021-08-21T07:28:26Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 137.608142ms before retrying.
2021-08-21T07:33:34Z I! [processors.ec2tagger] ec2tagger: Refresh is no longer needed, stop refreshTicker.

ec2tagger doesn't like not being able to access the instance metadata service and the containers will restart. Once I set hostNetwork to true I started seeing metrics flow into ContainerInsights. This was even though the DaemonSet is set to a service account that using IAM Roles for Service Accounts (IRSA) with a policy that give it ec2:DescribeVolumes & ec2:DescribeTags

Can an update be made that allows this to work without host network enabled on the daemonset?

fitchtech commented 3 years ago

Also, the IAM policy document attached to the IRSA role needs allow sts:AssumeRoleWithWebIdentity & sts:AssumeRole resource restricted to the IRSA role ARN or it will throw access denied errors and assume role API call.

fitchtech commented 3 years ago

@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6

[grpc]
address = "/run/dockershim.sock"

You can follow non default path in #188 (comment)

          hostPath:
            # path: /run/containerd/containerd.sock
            # bottle rocket does not mount containerd sock at normal place
            # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
            path: /run/dockershim.sock

cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.

The official EKS helm charts for CloudWatch Metrics should be updated to do this instead of applying manifests so that you can use helm templates to conditionally set those based on values provided.