Open pingleig opened 3 years ago
Created a temp image based on #189 (the latest official release now contains this fix) and the daemonset yaml need to be udpated to mount public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
/run/containerd/containerd.sock
NOTE: If you are using bottlerocket on eks, the socket on host is different due to https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b You need to (and only need to) replace the volumes part to pick the right sock on host. (Full snippet is at end of comment).
volumes:
# ...
- name: containerdsock
hostPath:
# path: /run/containerd/containerd.sock
# bottlerocket does not mount containerd sock at normal place
# https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
path: /run/dockershim.sock
When host (and kubelet) is using /run/containerd/containerd.sock
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloudwatch-agent
namespace: amazon-cloudwatch
spec:
selector:
matchLabels:
name: cloudwatch-agent
template:
metadata:
labels:
name: cloudwatch-agent
spec:
containers:
- name: cloudwatch-agent
image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
imagePullPolicy: Always
#ports:
# - containerPort: 8125
# hostPort: 8125
# protocol: UDP
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 200m
memory: 200Mi
# Please don't change below envs
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CI_VERSION
value: "k8s/1.3.0"
# Please don't change the mountPath
volumeMounts:
- name: cwagentconfig
mountPath: /etc/cwagentconfig
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
volumes:
- name: cwagentconfig
configMap:
name: cwagentconfig
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
path: /run/containerd/containerd.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/
terminationGracePeriodSeconds: 60
serviceAccountName: cloudwatch-agent
NOTE: You only need to change the volumes, when mount into cloudwatch agent container, you should still put it at default path.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloudwatch-agent
namespace: amazon-cloudwatch
spec:
selector:
matchLabels:
name: cloudwatch-agent
template:
metadata:
labels:
name: cloudwatch-agent
spec:
# aws eks update-kubeconfig --name eks-pod-metric-missing --region us-west-2
containers:
- name: cloudwatch-agent
image: public.ecr.aws/p5m3p1a7/cwagent-k8s-containerd-pod:0.1
imagePullPolicy: Always
#ports:
# - containerPort: 8125
# hostPort: 8125
# protocol: UDP
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 200m
memory: 200Mi
# Please don't change below envs
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CI_VERSION
value: "k8s/1.3.0"
# Please don't change the mountPath
volumeMounts:
- name: cwagentconfig
mountPath: /etc/cwagentconfig
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
volumes:
- name: cwagentconfig
configMap:
name: cwagentconfig
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
# path: /run/containerd/containerd.sock
# bottle rocket does not mount containerd sock at normal place
# https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
path: /run/dockershim.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/
terminationGracePeriodSeconds: 60
serviceAccountName: cloudwatch-agent
Another known issue is because we are using cadvisor, pod level filesystem usage is ignored
"container_filesystem_available",
"container_filesystem_capacity",
"container_filesystem_usage",
"container_filesystem_utilization"
func (h *containerdContainerHandler) GetSpec() (info.ContainerSpec, error) {
// TODO: Since we dont collect disk usage stats for containerd, we set hasFilesystem
// to false. Revisit when we support disk usage stats for containerd
hasFilesystem := false
spec, err := common.GetSpec(h.cgroupPaths, h.machineInfoFactory, h.needNet(), hasFilesystem)
spec.Labels = h.labels
spec.Envs = h.envs
spec.Image = h.image
return spec, err
}
NOTE: container file system usage is not provided after switching to containerd https://github.com/google/cadvisor/issues/2785
Created another issue to track the container filesystem metrics https://github.com/aws/amazon-cloudwatch-agent/issues/192
Reopen this issue since we are still in the release process, and the official container insight public doc plus sample manifest is not updated yet.
This needs fixed within the official helm charts for EKS https://github.com/aws/eks-charts/blob/master/stable/aws-cloudwatch-metrics/templates/daemonset.yaml
@pingleig I have tried applying the fix listed above exactly as is on EKS with the containerd runtime enabled. However, I'm still getting the same error messages:
2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded 2021-08-21T00:08:59Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes 2021-08-21T00:09:00Z W! No pod metric collected, metrics count is still 5 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188 2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49605661750447750614958043896578931231172344896032866930 2021-08-21T00:09:05Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 105.761168ms before retrying.
Support for containerd runtime on EKS was added in July when EKS 1.21 was released. https://aws.amazon.com/blogs/containers/amazon-eks-1-21-released/
@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6
[grpc]
address = "/run/dockershim.sock"
You can follow non default path in https://github.com/aws/amazon-cloudwatch-agent/issues/188#issuecomment-803764697
hostPath:
# path: /run/containerd/containerd.sock
# bottle rocket does not mount containerd sock at normal place
# https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b
path: /run/dockershim.sock
cc @sethAmazon since both EKS EC2 and Bottlerocket are using /run/dockershim.sock
we may change this to =default. Though I was testing using kops at that time, which uses /run/containerd/containerd.sock
. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.
@pingleig that worked, thank you. One additional change I had to make is to enable hostNetwork, cause the EC2 instances in my EKS 1.21 node group has the Instance MetaData Service (IMDS) restricted per the EKS security best practices . You have to set hostNetwork: true for it to be able to start up. Once I did everything loaded in the ContainerInsights console.
With hostNetwork: false I get this
2021/08/21 07:23:59 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2021-08-21T07:23:59Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:23:59Z I! Loaded inputs: k8sapiserver cadvisor
2021-08-21T07:23:59Z I! Loaded aggregators:
2021-08-21T07:23:59Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:23:59Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:23:59Z I! Tags enabled:
2021-08-21T07:23:59Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:23:59Z I! [logagent] starting
2021-08-21T07:23:59Z I! [logagent] found plugin cloudwatchlogs is a log backend
With hostNetwork: true
2021/08/21 07:28:18 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2021-08-21T07:28:18Z I! Starting AmazonCloudWatchAgent 1.247349.0
2021-08-21T07:28:18Z I! Loaded inputs: cadvisor k8sapiserver
2021-08-21T07:28:18Z I! Loaded aggregators:
2021-08-21T07:28:18Z I! Loaded processors: ec2tagger k8sdecorator
2021-08-21T07:28:18Z I! Loaded outputs: cloudwatchlogs
2021-08-21T07:28:18Z I! Tags enabled:
2021-08-21T07:28:18Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-10-106-12-9.ec2.internal", Flush Interval:1s
2021-08-21T07:28:18Z I! [logagent] starting
2021-08-21T07:28:18Z I! [logagent] found plugin cloudwatchlogs is a log backend
2021-08-21T07:28:18Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2021-08-21T07:28:18Z I! k8sapiserver Switch New Leader: ip-10-106-12-14.ec2.internal
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2021-08-21T07:28:19Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2021-08-21T07:28:26Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 137.608142ms before retrying.
2021-08-21T07:33:34Z I! [processors.ec2tagger] ec2tagger: Refresh is no longer needed, stop refreshTicker.
ec2tagger doesn't like not being able to access the instance metadata service and the containers will restart. Once I set hostNetwork to true I started seeing metrics flow into ContainerInsights. This was even though the DaemonSet is set to a service account that using IAM Roles for Service Accounts (IRSA) with a policy that give it ec2:DescribeVolumes & ec2:DescribeTags
Can an update be made that allows this to work without host network enabled on the daemonset?
Also, the IAM policy document attached to the IRSA role needs allow sts:AssumeRoleWithWebIdentity & sts:AssumeRole resource restricted to the IRSA role ARN or it will throw access denied errors and assume role API call.
@fitchtech. The containerd socket on host is in a different path (same as bottlerocket). This is PR for EKS AMI https://github.com/awslabs/amazon-eks-ami/pull/698/files and the config file https://github.com/awslabs/amazon-eks-ami/blob/8450297eb2ef87fe5cbbce52a86ddcdc8b2e6716/files/containerd-config.toml#L1-L6
[grpc] address = "/run/dockershim.sock"
You can follow non default path in #188 (comment)
hostPath: # path: /run/containerd/containerd.sock # bottle rocket does not mount containerd sock at normal place # https://github.com/bottlerocket-os/bottlerocket/commit/91810c85b83ff4c3660b496e243ef8b55df0973b path: /run/dockershim.sock
cc @sethAmazon since both EKS EC2 and Bottlerocket are using
/run/dockershim.sock
we may change this to =default. Though I was testing using kops at that time, which uses/run/containerd/containerd.sock
. I am not sure if it's possible to have one manifest that works for both in our example manifest. Though it should doable for helm.
The official EKS helm charts for CloudWatch Metrics should be updated to do this instead of applying manifests so that you can use helm templates to conditionally set those based on values provided.
This is exported from internal ticket
TL;DR
The latest image is released, if you were using temp image from this comment https://github.com/aws/amazon-cloudwatch-agent/issues/188#issuecomment-803764697 please update to the latest tag.
If the error message
W! No pod metric collected, metrics count is still 7 is containerd socket mounted? https://github.com/aws/amazon-cloudwatch-agent/issues/188
leads you to this issue/run/dockershim.sock
instead of/run/containerd/containerd.sock
Background
We were relying on pause container to have
POD
for detecting pod, which is the case for docker but not for containerd https://github.com/containerd/cri/issues/922#issuecomment-423729537User will not see pod metrics in container insight dashboard and they will find the following log which is introduced in #171
https://github.com/aws/amazon-cloudwatch-agent/blob/fbdd619269be7a00172e06992a8d40b22be1a6d7/plugins/inputs/cadvisor/container_info_processor.go#L72-L72
The root cause is we are expecting
containerName == 'POD'
to mark a path as podhttps://github.com/aws/amazon-cloudwatch-agent/blob/fbdd619269be7a00172e06992a8d40b22be1a6d7/plugins/inputs/cadvisor/container_info_processor.go#L119-L126
Fix
Release
The fix will be included in next release, the release date is not determined (yet).