aws-observability / aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)
https://aws-otel.github.io/
Other
573 stars 239 forks source link

log_retention for the awsemf doesn't appear to be working #1768

Closed lorelei-rupp-imprivata closed 1 year ago

lorelei-rupp-imprivata commented 1 year ago

Describe the bug This was fixed by https://github.com/aws-observability/aws-otel-collector/issues/991 But I am not seeing this actually work when trying to implement it

Steps to reproduce Rolled out 0.25.0 of the collector Updated my config map to have something like

apiVersion: v1
data:
  collector.yaml: |
    exporters:
      awsemf:
        dimension_rollup_option: NoDimensionRollup
        log_group_name: /aws/containerinsights/{ClusterName}/performance
        log_retention: 7
        log_stream_name: '{NodeName}'
......

Rolled this out, then manually deleted the log group in cloud watch It was recreated, however no log retention was set, still says Never Expire

What did you expect to see? Log retention would be set on the new log group that was created

Can provide more things if necessary Maybe I am not implementing this properly

@bryan-aguilar @humivo

We are using the Daemonset to roll this out with the adot operator I bumped up to image: 'amazon/aws-otel-collector:v0.25.0'

bryan-aguilar commented 1 year ago

I was not able to replicate this on my local system using this config.

exporters:
  awsxray:
    region: us-west-2
    local_mode: true
    no_verify_ssl: true
  awsemf:
    region: us-west-2
    dimension_rollup_option: NoDimensionRollup
    log_group_name: /aws/containerinsights/testtest/performance
    log_retention: 7
    log_stream_name: 'testNodeStreamName'
  logging:
    loglevel: debug

image

I'm wondering if the fact that it's deployed through a deamonset has something to do with it. @lorelei-rupp-imprivata could you try reproducing this with a minimal config and daemonset? If you can reproduce again can you share the deamonset manifest? I also wonder if there if something is getting messed up with the {ClusterName} substitution. I'm not familiar with this code and will loop @humivo in. Could you also try a fixed log group and log stream name for testing @lorelei-rupp-imprivata?

lorelei-rupp-imprivata commented 1 year ago

Let me try a fixed log_stream_name and a fixed_log_group_name.

I don't thin our config is that advanced either. Im happy to provide our config map if you want it

lorelei-rupp-imprivata commented 1 year ago

I tested with fixed stuff and still did not work

    exporters:
        awsemf:
          namespace: ContainerInsights
          log_group_name: /aws/containerinsights/testtest/performance
          log_retention: 7
          log_stream_name: 'testNodeStreamName'
          resource_to_telemetry_conversion:
            enabled: true
          dimension_rollup_option: NoDimensionRollup
          parse_json_encoded_attr_values: [Sources, kubernetes]
          metric_declarations:
.....
bryan-aguilar commented 1 year ago

Is it possible that another application is writing to/creating this log group before the collector does? I believe the default behavior is that if the log group already exists it won't overwrite the retention settings.

bryan-aguilar commented 1 year ago

Can you share the configmap for the collector?

lorelei-rupp-imprivata commented 1 year ago

@bryan-aguilar we dont create the log group ourselves, I am testing manually deleting it and then it is automatically recreated. I even tested changing the name in the configmap to prove it was that configmap recreating it for us

Here is our configmap

apiVersion: v1
data:
  collector.yaml: |
    exporters:
      awsemf:
        dimension_rollup_option: NoDimensionRollup
        log_group_name: /aws/containerinsights/testtest/performance
        log_retention: 7
        log_stream_name: testNodeStreamName
        metric_declarations:
        - dimensions:
          - - NodeName
            - InstanceId
            - ClusterName
          metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
        - dimensions:
          - - ClusterName
          metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
          - node_cpu_usage_total
          - node_cpu_limit
          - node_memory_working_set
          - node_memory_limit
        - dimensions:
          - - FullPodName
            - Namespace
            - ClusterName
          - - Service
            - FullPodName
            - Namespace
            - ClusterName
          - - Service
            - Namespace
            - ClusterName
          - - Namespace
            - ClusterName
          - - ClusterName
          metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit
        - dimensions:
          - - Service
            - FullPodName
            - Namespace
            - ClusterName
          - - ClusterName
          metric_name_selectors:
          - pod_cpu_reserved_capacity
          - pod_memory_reserved_capacity
        - dimensions:
          - - Service
            - FullPodName
            - Namespace
            - ClusterName
          metric_name_selectors:
          - pod_number_of_container_restarts
        - dimensions:
          - - ClusterName
          metric_name_selectors:
          - cluster_node_count
          - cluster_failed_node_count
        - dimensions:
          - - Service
            - Namespace
            - ClusterName
          - - ClusterName
          metric_name_selectors:
          - service_number_of_running_pods
        - dimensions:
          - - NodeName
            - InstanceId
            - ClusterName
          - - ClusterName
          metric_name_selectors:
          - node_filesystem_utilization
        - dimensions:
          - - Namespace
            - ClusterName
          - - ClusterName
          metric_name_selectors:
          - namespace_number_of_running_pods
        namespace: ContainerInsights
        parse_json_encoded_attr_values:
        - Sources
        - kubernetes
        region: us-east-2
        resource_to_telemetry_conversion:
          enabled: true
      logging:
        loglevel: debug
      prometheusremotewrite:
        auth:
          authenticator: sigv4auth
        endpoint: xxxxx
        external_labels:
          astra_environment: xxxxx
          cluster_name: xxxxxeks
    extensions:
      health_check: {}
      memory_ballast:
        size_mib: 450
      pprof:
        endpoint: ${MY_POD_IP}:1777
      sigv4auth:
        assume_role:
          sts_region: us-east-2
      zpages:
        endpoint: ${MY_POD_IP}:55679
    processors:
      memory_limiter:
        check_interval: 2s
        limit_mib: 900
        spike_limit_mib: 100
      resource:
        attributes:
        - action: insert
          from_attribute: job
          key: TaskId
        - action: insert
          key: receiver
          value: prometheus
      resourcedetection/ec2:
        detectors:
        - env
        override: false
        timeout: 2s
    receivers:
      awscontainerinsightreceiver:
        add_full_pod_name_metric_label: true
        prefer_full_pod_name: true
      otlp:
        protocols:
          grpc:
            endpoint: ${MY_NODE_IP}:4317
      prometheus:
        config:
          global:
            scrape_interval: 1m
            scrape_timeout: 10s
          scrape_configs:
          - job_name: kube-state-metrics
            static_configs:
            - targets:
              - kube-state-metrics.kube-system.svc.cluster.local:8080
          - job_name: cluster-autoscaler
            static_configs:
            - targets:
              - cluster-autoscaler-aws-cluster-autoscaler.kube-system.svc.cluster.local:8085
          - job_name: eso
            kubernetes_sd_configs:
            - role: pod
            metrics_path: /metrics
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: keep
              regex: external-secrets
              source_labels:
              - __meta_kubernetes_pod_container_name
    service:
      extensions:
      - health_check
      - zpages
      - pprof
      - memory_ballast
      - sigv4auth
      pipelines:
        metrics:
          exporters:
          - awsemf
          processors:
          - memory_limiter
          - resourcedetection/ec2
          receivers:
          - awscontainerinsightreceiver
        metrics/2:
          exporters:
          - prometheusremotewrite
          processors:
          - memory_limiter
          - resourcedetection/ec2
          - resource
          receivers:
          - prometheus
          - otlp
kind: ConfigMap
humivo commented 1 year ago

Thank you for the information. I am trying to reproduce the issue on my side right now and will let you know the results

lorelei-rupp-imprivata commented 1 year ago

Thanks! Happy to test out anything too on our end, just let me know

bryan-aguilar commented 1 year ago

Does the IAM role that is associated with the collector either through IRSA/Node have the requisite permissions? logs:PutRetentionPolicy?

lorelei-rupp-imprivata commented 1 year ago

Does the IAM role that is associated with the collector either through IRSA/Node have the requisite permissions? logs:PutRetentionPolicy?

Oh Good thinking - let me go check on that! We def operate Least Priv here so its highly likely it doesn't have it

lorelei-rupp-imprivata commented 1 year ago

Yep this must be it, we set


  statement {
    sid = "AllowADOTPolicy"
    actions = [
      "aps:RemoteWrite",
      "aps:GetSeries",
      "aps:GetLabels",
      "aps:GetMetricMetadata",
      "ec2:DescribeTags",
      "ec2:DescribeVolumes",
      "logs:PutLogEvents",
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:DescribeLogStreams",
      "logs:DescribeLogGroups"
    ]
    resources = ["*"]
  }
}```
bryan-aguilar commented 1 year ago

That looks highly probable! Could you update the policy and retest to let us know?

lorelei-rupp-imprivata commented 1 year ago

Yep that was it @bryan-aguilar ! So maybe we just update docs for this perhaps? https://aws-otel.github.io/docs/setup/permissions

humivo commented 1 year ago

I can update the docs to add this policy in the permission setup guide.

bryan-aguilar commented 1 year ago

I have a PR in flight already.

bryan-aguilar commented 1 year ago

Doc update PR has been made and I'm going to close this as resolved.