fluent-plugins-nursery / fluent-plugin-cloudwatch-logs

CloudWatch Logs Plugin for Fluentd
MIT License
201 stars 142 forks source link

DescribeLogStreams being called too much? #192

Closed PenelopeFudd closed 4 years ago

PenelopeFudd commented 4 years ago

Problem

Our three K8s clusters are calling DescribeLogStreams so frequently (~15000 calls per hour), that the AWS console is showing 'rate exceeded' errors when we try to do the same.

Of those calls, 66% are from fluentd and this plugin, accessing just 51 unique log group/stream combinations.

Steps to replicate

    <source>
      @type forward
      port 24321
      bind 0.0.0.0
      @label @containers
    </source>

    <label @containers>
      <filter **>
        @type kubernetes_metadata
        @id filter_kube_metadata
      </filter>

      <filter **>
        @type record_transformer
        @id filter_containers_stream_transformer
        enable_ruby true
        <record>
          log_stream_name ${"#{record.fetch("kubernetes", Hash.new).fetch("pod_name", "unknown_pod")}-#{record.fetch("kubernetes", Hash.new).fetch("container_name", "container")}"}
          log_group_name ${"/aws/eks/#{ENV.fetch('CLUSTER_ENVIRONMENT')}/#{ENV.fetch('CLUSTER_NAME')}/#{record.fetch("kubernetes", Hash.new).fetch("namespace_name", "unknown_container")}"}
        </record>
      </filter>

      <match **>
        @type relabel
        @label @NORMAL
      </match>
    </label>

    <label @NORMAL>
      <match containers.**>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_containers
        region "#{ENV.fetch('REGION')}"
        log_group_name_key log_group_name
        log_stream_name_key log_stream_name
        remove_log_group_name_key true
        remove_log_stream_name_key true
        auto_create_stream true
        message_keys log
        retention_in_days 60
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_all
        region "#{ENV.fetch('REGION')}"
        log_group_name_key log_group_name
        log_stream_name_key log_stream_name
        remove_log_group_name_key true
        remove_log_stream_name_key true
        auto_create_stream true
        retention_in_days 60
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>
    </label>

We're using the fluent/fluentd-kubernetes-daemonset:v1.10.4-debian-cloudwatch-1.0 docker container, found here: https://github.com/fluent/fluentd-kubernetes-daemonset

Expected Behavior or What you need to ask

I'm expecting that the code wouldn't call describe_log_streams nearly so often.

The problem seems to be here: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs/blob/b500459107d9fd1507def77614178383b5cc0d58/lib/fluent/plugin/out_cloudwatch_logs.rb#L378-L388

Which calls here: https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs/blob/b500459107d9fd1507def77614178383b5cc0d58/lib/fluent/plugin/out_cloudwatch_logs.rb#L499

AWS Support informed us that put_log_events returns the expectedSequenceToken value in the error message, so describe_log_streams doesn't have to be called: https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/CloudWatchLogs/Client.html#put_log_events-instance_method

Using Fluentd and CloudWatchLogs plugin versions

cosmo0920 commented 4 years ago

Thanks for the detailed information and possible improvements. I'd registered the PR to fix this: #194.