aws / amazon-cloudwatch-agent

CloudWatch Agent enables you to collect and export host-level metrics and logs on instances running Linux or Windows server.
MIT License
433 stars 195 forks source link

CW Agent not recognizing enhanced_container_insights=true using EKS/ConfigMap #1030

Closed kangadrewie closed 6 months ago

kangadrewie commented 6 months ago

Describe the bug Recently, the eks-charts team supported passing enhanced_container_insights into the CW Agent ConfigMap (https://github.com/aws/eks-charts/pull/1041)

However, it appears despite my ConfigMap being updated to have Enhanced Observability turned on, the CW agent is not recognising it. Taking a look at the config translator config, none of the if enhancedContainerInsightsEnabled { conditions succeed. So, no enhanced metrics are added to the awsemf/containerinsights exporter config.

Is anyone able to reproduce this issue on EKS?

Steps to reproduce Use v.0.0.10 for Enhanced Observability metrics in EKS - https://github.com/aws/eks-charts/pull/1041

What did you expect to see? Enhanced Observability metrics being pushed to CW / Agent's OTEL config updated when enhanced_container_insights=true

What did you see instead? Default Kubernetes Container Insights config being loaded.

What version did you use? Version: v1.300032.3b392

What config did you use? Config: (e.g. the agent json config file)

Environment OS: (e.g., "Ubuntu 20.04")

Additional context

ConfigMap

apiVersion: v1
data:
  cwagentconfig.json: |
    {
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "<cluster-name>",
            "enhanced_container_insights": "true",
            "metrics_collection_interval": 60
          }
        },
        "force_flush_interval": 5
      }
    }
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: aws-cloudwatch-metrics
    meta.helm.sh/release-namespace: kube-addons
  creationTimestamp: "2024-02-10T18:21:50Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-cloudwatch-metrics
    app.kubernetes.io/version: 1.300032.2b361
    helm.sh/chart: aws-cloudwatch-metrics-0.0.10
  name: aws-cloudwatch-metrics
  namespace: kube-addons

Cloudwatch agent pod logs


D! [EC2] Found active network interface
I! imds retry client will retry 1 timesI! Detected the instance is EC2
2024/02/11 13:15:15 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json ...
/opt/aws/amazon-cloudwatch-agent/bin/default_linux_config.json does not exist or cannot read. Skipping it.
2024/02/11 13:15:15 Reading json config file path: /etc/cwagentconfig/..2024_02_11_13_15_14.2575907930/cwagentconfig.json ...
2024/02/11 13:15:15 Find symbolic link /etc/cwagentconfig/..data
2024/02/11 13:15:15 Find symbolic link /etc/cwagentconfig/cwagentconfig.json
2024/02/11 13:15:15 Reading json config file path: /etc/cwagentconfig/cwagentconfig.json ...
2024/02/11 13:15:15 I! Valid Json input schema.
I! Trying to detect region from ec2
2024/02/11 13:15:15 I! attempt to access ECS task metadata to determine whether I'm running in ECS.
2024/02/11 13:15:16 W! retry [0/3], unable to get http response from http://<>/v2/metadata, error: unable to get response from http://<>/v2/metadata, error: Get "http://<>/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024/02/11 13:15:17 W! retry [1/3], unable to get http response from http://<>/v2/metadata, error: unable to get response from http://<>/v2/metadata, error: Get "http://<>/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024/02/11 13:15:18 W! retry [2/3], unable to get http response from http://<>/v2/metadata, error: unable to get response from http://<>/v2/metadata, error: Get "http://<>/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2024/02/11 13:15:18 I! access ECS task metadata fail with response unable to get response from http://<>/v2/metadata, error: Get "http://<>/v2/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers), assuming I'm not running in ECS.
2024/02/11 13:15:18 Configuration validation first phase succeeded
2024/02/11 13:15:18 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2024/02/11 13:15:18 D! config [agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = "ip-<>.ec2.internal"
  interval = "60s"
  logfile = ""
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = false

[outputs]

  [[outputs.cloudwatchlogs]]
    force_flush_interval = "5s"
    log_stream_name = "ip-<>.ec2.internal"
    mode = "EC2"
    region = "us-east-1"
    region_type = "EC2M"
2024/02/11 13:15:18 I! Config has been translated into YAML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.yaml
2024/02/11 13:15:18 D! config connectors: {}
exporters:
    awsemf/containerinsights:
        certificate_file_path: ""
        detailed_metrics: false
        dimension_rollup_option: NoDimensionRollup
        disable_metric_extraction: false
        eks_fargate_container_insights_enabled: false
        endpoint: ""
        enhanced_container_insights: false
        imds_retries: 1
        local_mode: false
        log_group_name: /aws/containerinsights/{ClusterName}/performance
        log_retention: 0
        log_stream_name: '{NodeName}'
        max_retries: 2
        metric_declarations:
            - dimensions:
                - - ClusterName
                  - Namespace
                  - PodName
                - - ClusterName
                - - ClusterName
                  - Namespace
                  - Service
                - - ClusterName
                  - Namespace
              label_matchers: []
              metric_name_selectors:
                - pod_cpu_utilization
                - pod_memory_utilization
                - pod_network_rx_bytes
                - pod_network_tx_bytes
                - pod_cpu_utilization_over_pod_limit
                - pod_memory_utilization_over_pod_limit
            - dimensions:
                - - ClusterName
                  - Namespace
                  - PodName
              label_matchers: []
              metric_name_selectors:
                - pod_number_of_container_restarts
            - dimensions:
                - - ClusterName
                  - Namespace
                  - PodName
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - pod_cpu_reserved_capacity
                - pod_memory_reserved_capacity
            - dimensions:
                - - ClusterName
                  - InstanceId
                  - NodeName
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - node_cpu_utilization
                - node_memory_utilization
                - node_network_total_bytes
                - node_cpu_reserved_capacity
                - node_memory_reserved_capacity
                - node_number_of_running_pods
                - node_number_of_running_containers
            - dimensions:
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - node_cpu_usage_total
                - node_cpu_limit
                - node_memory_working_set
                - node_memory_limit
            - dimensions:
                - - ClusterName
                  - InstanceId
                  - NodeName
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - node_filesystem_utilization
            - dimensions:
                - - ClusterName
                  - Namespace
                  - Service
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - service_number_of_running_pods
            - dimensions:
                - - ClusterName
                  - Namespace
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - namespace_number_of_running_pods
            - dimensions:
                - - ClusterName
              label_matchers: []
              metric_name_selectors:
                - cluster_node_count
                - cluster_failed_node_count
        metric_descriptors: []
        middleware: agenthealth/logs
        namespace: ContainerInsights
        no_verify_ssl: false
        num_workers: 8
        output_destination: cloudwatch
        parse_json_encoded_attr_values:
            - Sources
            - kubernetes
        profile: ""
        proxy_address: ""
        region: us-east-1
        request_timeout_seconds: 30
        resource_arn: ""
        resource_to_telemetry_conversion:
            enabled: true
        retain_initial_value_of_delta_metric: false
        role_arn: ""
        shared_credentials_file: []
        version: "0"
extensions:
    agenthealth/logs:
        is_usage_data_enabled: true
        stats:
            operations:
                - PutLogEvents
processors:
    batch/containerinsights:
        metadata_cardinality_limit: 1000
        metadata_keys: []
        send_batch_max_size: 0
        send_batch_size: 8192
        timeout: 5s
receivers:
    awscontainerinsightreceiver:
        add_container_name_metric_label: false
        add_full_pod_name_metric_label: false
        add_service_as_attribute: true
        certificate_file_path: ""
        cluster_name: <cluster-name>
        collection_interval: 1m0s
        container_orchestrator: eks
        enable_control_plane_metrics: false
        endpoint: ""
        imds_retries: 1
        leader_lock_name: cwagent-clusterleader
        leader_lock_using_config_map_only: true
        local_mode: false
        max_retries: 0
        no_verify_ssl: false
        num_workers: 0
        prefer_full_pod_name: false
        profile: ""
        proxy_address: ""
        region: us-east-1
        request_timeout_seconds: 0
        resource_arn: ""
        role_arn: ""
        shared_credentials_file: []
service:
    extensions:
        - agenthealth/logs
    pipelines:
        metrics/containerinsights:
            exporters:
                - awsemf/containerinsights
            processors:
                - batch/containerinsights
            receivers:
                - awscontainerinsightreceiver
    telemetry:
        logs:
            development: false
            disable_caller: false
            disable_stacktrace: false
            encoding: console
            error_output_paths: []
            initial_fields: {}
            level: info
            output_paths: []
            sampling:
                enabled: true
                initial: 2
                thereafter: 500
                tick: 10s
        metrics:
            address: ""
            level: None
            readers: []
        resource: {}
        traces:
            processors: []
            propagators: []
2024-02-11T13:15:18Z I! Starting AmazonCloudWatchAgent CWAgent/1.300032.3b392 (go1.21.5; linux; amd64) with log file  with log target lumberjack
kangadrewie commented 6 months ago

Not a CW Agent issue -- is a config issue in the helm chart, PR referenced above to fix.

ivan-sukhomlyn commented 6 months ago

@kangadrewie, thanks for taking care of that!