aws / amazon-cloudwatch-agent

CloudWatch Agent enables you to collect and export host-level metrics and logs on instances running Linux or Windows server.
MIT License
420 stars 185 forks source link

[K8E mode] CloudWatch Agent shutting down when configured to collect container insights metric on EC2 K8s setup #1215

Open dhirajk-secsol opened 2 weeks ago

dhirajk-secsol commented 2 weeks ago

Describe the bug

Cloudwatch agent container insights crashes in K8s EC2 K8E mode but works as expected in K8s on-premise K8OP mode.

I am running cloudwatch agent as daemonset to collect container insight metrics in MicroK8s setup on AWS EC2 instance. When I configure cwagent to collect kubernetes container insights it starts and shuts down immediately with error

2024-06-17T00:49:16Z I! CWAGENT_LOG_LEVEL is set to "DEBUG"
2024-06-17T00:49:16Z I! Starting AmazonCloudWatchAgent CWAgent/1.300039.0b612 (go1.22.2; linux; amd64) with log file  with log target lumberjack
2024-06-17T00:49:16Z I! AWS SDK log level not set
2024-06-17T00:49:17Z I! {"caller":"service@v0.98.0/telemetry.go:47","msg":"Skipping telemetry setup.","address":"","level":"None"}
2024-06-17T00:49:17Z D! {"caller":"extension@v0.98.0/extension.go:165","msg":"Alpha component. May change in the future.","kind":"extension","name":"agenthealth/logs"}
2024-06-17T00:49:17Z D! {"caller":"exporter@v0.98.0/exporter.go:273","msg":"Beta component. May change in the future.","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights"}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:98","msg":"Using proxy address: ","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights","proxyAddr":""}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:194","msg":"Fetch region from commandline/config file","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights","region":"ap-northeast-1"}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:366","msg":"Fallback shared config file(s)","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights","files":["/.aws/credentials"]}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:390","msg":"Using credential from session","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights","access-key":"XXXXXXXXXXXX","provider":"EnvConfigCredentials"}
2024-06-17T00:49:17Z W! {"caller":"awsemfexporter@v0.98.0/emf_exporter.go:99","msg":"the default value for DimensionRollupOption will be changing to NoDimensionRollupin a future release. See https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23997 for moreinformation","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights"}
2024-06-17T00:49:17Z D! {"caller":"processor@v0.98.0/processor.go:301","msg":"Beta component. May change in the future.","kind":"processor","name":"batch/containerinsights","pipeline":"metrics/containerinsights"}
2024-06-17T00:49:17Z I! {"caller":"service@v0.98.0/service.go:143","msg":"Starting CWAgent...","Version":"1.300039.0b612","NumCPU":16}
2024-06-17T00:49:17Z I! {"caller":"extensions/extensions.go:34","msg":"Starting extensions..."}
2024-06-17T00:49:17Z I! {"caller":"extensions/extensions.go:37","msg":"Extension is starting...","kind":"extension","name":"agenthealth/logs"}
2024-06-17T00:49:17Z I! {"caller":"extensions/extensions.go:52","msg":"Extension started.","kind":"extension","name":"agenthealth/logs"}
2024-06-17T00:49:17Z D! {"caller":"awsmiddleware@v0.0.0-20240503173519-cc2b921759f4/helper.go:18","msg":"Configured middleware on AWS client","kind":"exporter","data_type":"metrics","name":"awsemf/containerinsights","middleware":"agenthealth/logs"}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:98","msg":"Using proxy address: ","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics","proxyAddr":""}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:194","msg":"Fetch region from commandline/config file","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics","region":"ap-northeast-1"}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:366","msg":"Fallback shared config file(s)","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics","files":["/.aws/credentials"]}
2024-06-17T00:49:17Z D! {"caller":"awsutil@v0.98.0/conn.go:390","msg":"Using credential from session","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics","access-key":"XXXXXXXXXXXX","provider":"EnvConfigCredentials"}
2024-06-17T00:49:17Z I! {"caller":"host/ec2metadata.go:78","msg":"Fetch instance id and type from ec2 metadata","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics"}
2024-06-17T00:49:17Z I! {"caller":"service@v0.98.0/service.go:206","msg":"Starting shutdown..."}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x33b01ae]

goroutine 1 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/stores.(*K8sDecorator).Shutdown(0x3cb01a0?)
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver@v0.98.0/internal/stores/store.go:105 +0xe
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver.(*awsContainerInsightReceiver).Shutdown(0xc000c0e340, {0xc000d78210?, 0xc000f12bc8?})
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver@v0.98.0/receiver.go:341 +0x5bb
go.opentelemetry.io/collector/service/internal/graph.(*Graph).ShutdownAll(0xc000319ce0, {0x50ffcd0, 0x78b5600})
        go.opentelemetry.io/collector/service@v0.98.0/internal/graph/graph.go:435 +0x1a8
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc000a2d320, {0x50ffcd0, 0x78b5600})
        go.opentelemetry.io/collector/service@v0.98.0/service.go:212 +0xcf
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc000962690, {0x50ffcd0, 0x78b5600})
        go.opentelemetry.io/collector/otelcol@v0.98.0/collector.go:207 +0x76a
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000962690, {0x50ffcd0, 0x78b5600})
        go.opentelemetry.io/collector/otelcol@v0.98.0/collector.go:249 +0x52
go.opentelemetry.io/collector/otelcol.NewCommand.func1(0xc000656c08, {0x4746cce?, 0x7?, 0x4741772?})
        go.opentelemetry.io/collector/otelcol@v0.98.0/command.go:35 +0xa7
github.com/spf13/cobra.(*Command).execute(0xc000656c08, {0xc000697580, 0x1, 0x1})
        github.com/spf13/cobra@v1.8.0/command.go:983 +0xaca
github.com/spf13/cobra.(*Command).ExecuteC(0xc000656c08)
        github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0xc001086e10?)
        github.com/spf13/cobra@v1.8.0/command.go:1039 +0x13
main.runAgent({0x50ffd40, 0xc000e820f0}, {0x78b5600, 0x0, 0x0}, {0x78b5600, 0x0, 0x0})
        github.com/aws/amazon-cloudwatch-agent/cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go:358 +0x1012
main.reloadLoop(0xc0000f2120, {0x78b5600, 0x0, 0x0}, {0x78b5600, 0x0, 0x0}, {0xc000b0dde0, 0x0, 0x0}, ...)
        github.com/aws/amazon-cloudwatch-agent/cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go:178 +0x347
main.main()
        github.com/aws/amazon-cloudwatch-agent/cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go:605 +0xa5c

While same cloudwatch agent setup (K8E mode) works as expected when configured to collect cpu, disk and memory metrics. It also works as expected when configured to collect prometheus metrics, but it shuts down when configured to collect kubernetes container insights.

Details about my setup -

Same setup runs as expected on my local (non ec2) environment where cwagent runs in on-premise mode and collects container insights as expected.

Steps to reproduce Run cwagent as daemonset on aws EC2 MicroK8s/K8s using following template

apiVersion: v1
kind: Namespace
metadata:
  name: amazon-cloudwatch

---
# create configmap for prometheus cwagent config
apiVersion: v1
data:
  # cwagent json config
  cwagentconfig.json: |
    {
      "agent": {
        "region": "ap-northeast-1",
        "debug": true
      },
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "June17",
            "enhanced_container_insights": true
          }
        },
        "force_flush_interval": 5
      }
    }
kind: ConfigMap
metadata:
  name: cwagentconfig
  namespace: amazon-cloudwatch

---
# create cwagent service account and role binding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cloudwatch-agent-role
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "endpoints","services"]
    verbs: ["list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets","deployments","daemonsets","statefulsets"]
    verbs: ["list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["nodes/stats", "configmaps", "events"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cwagent-clusterleader"]
    verbs: ["get","update"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get", "list", "watch"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cloudwatch-agent-role-binding
subjects:
  - kind: ServiceAccount
    name: cloudwatch-agent
    namespace: amazon-cloudwatch
roleRef:
  kind: ClusterRole
  name: cloudwatch-agent-role
  apiGroup: rbac.authorization.k8s.io
---

# deploy cwagent as daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/cloudwatch-agent/cloudwatch-agent:1.300039.0b612
          imagePullPolicy: Always
          #ports:
          #  - containerPort: 8125
          #    hostPort: 8125
          #    protocol: UDP
          resources:
            limits:
              cpu:  400m
              memory: 400Mi
            requests:
              cpu: 400m
              memory: 400Mi
          # Please don't change below envs
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.0.1"
            - name: AWS_ACCESS_KEY_ID
              value: “xxxxxx”
            - name: AWS_SECRET_ACCESS_KEY
              value: “xxxxxx”
          # Please don't change the mountPath
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: kubeletsock
              mountPath: /var/lib/kubelet/pod-resources/kubelet.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            #path: /run/containerd/containerd.sock
            path: /var/snap/microk8s/common/run/containerd.sock
        - name: kubeletsock
          hostPath:
            path: /var/lib/kubelet/pod-resources/kubelet.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 180
      serviceAccountName: cloudwatch-agent

What did you expect to see? Cloudwatch agent to run and collect container insights metrics and push it to cloudwatch logs

What did you see instead? It crashed without proper error logs before starting to collect the metrics

What version did you use? Version: 1.300036.0b573 and 1.300039.0b612

What config did you use? Config:

    {
      "agent": {
        "region": "ap-northeast-1",
        "debug": true
      },
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "June17",
            "enhanced_container_insights": true
          }
        },
        "force_flush_interval": 5
      }
    }

Environment OS: Ubuntu 22.04.4

Additional context I am able run same setup on my local (non ec2) environment where cwagent runs in on-premise K8OP mode and collects container insights as expected.