Improve InvalidSequenceTokenException Being Logged Frequently

Problem

I know there are a lot of other tickets talking about this, but they either accept the log messages/metrics, or are very old.

We are getting a lot of InvalidSequenceTokenException errors in our logs and then our Prometheus metrics are also filled with these errors. I believe they are recovering fine with the retries that have been put in place, but I'm curious if there is a better way to work around this?

We use flush_threads: 4 in our buffer configuration and then run two replicas of this FluentD configuration. It seems that the only way to fix this is to run a single replica with a single thread, but then that wouldn't be redundant 😞

Lastly, I am curious what the difference between the concurrency parameter and the Buffer's parameter flush_threads is. They sound like they are doing the same thing, but maybe I should be using one over the other?

Steps to replicate


# that is a Kubernetes Pod's logs
# These are aggregated by a fluent-bit running on each
# Kubernetes node, and then forwarded to central processing, 
# which includes this configuration snippet

# NOTE: I have excluded prometheus and other non-essential pieces of the config

<source>
  @type forward
  port 24284
  bind 0.0.0.0
  tag pod.source
  @label @POD_SOURCE
</source>

<label @POD_SOURCE>
  <filter **>
    @type record_transformer
    enable_ruby true
    <record>
      namespace ${record["kubernetes"]["namespace_name"]}
      pod ${record["kubernetes"]["pod_name"]}
    </record>
  </filter>
  <match **>
    @type rewrite_tag_filter
    <rule>
      key     namespace
      pattern /(.+)/
      tag     $1
    </rule>
    @label @POD_STEP2
  </match>
</label>

<label @POD_STEP2>
  <match **>
    @type rewrite_tag_filter
    <rule>
      key     pod
      pattern /(.+)/
      tag     ${tag}_$1
    </rule>
    @label @POD_OUTPUT
  </match>
</label>

<label @POD_OUTPUT>
  <match **>
    @type copy
    <store>
      @type s3
      s3_bucket foobar
      s3_region us-east-1
      s3_object_key_format "#{ENV['ENVIRONMENT']}/eks-pod-logs/%Y-%m-%d/${tag}/%H_%{index}_%{uuid_flush}.%{file_extension}"
      <format>
        @type json
      </format>
      <buffer tag,time>
        timekey 1h
        @type memory
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 4
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 8MB
        chunk_full_threshold 0.90
        overflow_action throw_exception
        compress gzip
      </buffer>
    </store>
    <store>
      @type cloudwatch_logs
      log_group_name /infra/logs/eks/pods/stage
      log_stream_name %Y-%m-%d-%H-${tag}
      auto_create_stream true
      region us-east-1
      <buffer tag, time>
        timekey 1m
        @type memory
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 4
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 8MB
        chunk_full_threshold 0.90
        overflow_action throw_exception
        compress gzip
      </buffer>
    </store>
  </match>
</label>

Expected Behavior or What you need to ask

The s3 logging works just fine, but cloudwatch has a lot of errors about the token being out of sequence. This might come down to how AWS implements their services, but I feel like there must be a better way than retrying a bunch and having logs and metrics be filled up with errors, but maybe that would require a lot more work than it is worth? 🤷 Any suggestions are welcome, too.

Using Fluentd and CloudWatchLogs plugin versions

OS version: Docker image fluentd:v1.9.1-1.0
Bare Metal or within Docker or Kubernetes or others? within Kubernetes in AWS EKS
Fluentd v0.12 or v0.14/v1.0
- paste result of fluentd --version or td-agent --version => fluentd 1.9.1
Dependent gem versions
- paste boot log of fluentd or td-agent
- paste result of fluent-gem list, td-agent-gem list or your Gemfile.lock

/ $ fluent-gem list

*** LOCAL GEMS ***

async (1.25.0)
async-http (0.50.0)
async-io (1.27.7)
async-pool (0.3.0)
aws-eventstream (1.1.0)
aws-partitions (1.366.0)
aws-sdk-cloudwatchlogs (1.36.0)
aws-sdk-core (3.105.0)
aws-sdk-kms (1.37.0)
aws-sdk-s3 (1.79.1)
aws-sdk-sqs (1.32.0)
aws-sigv4 (1.2.2)
bigdecimal (1.4.4)
cmath (default: 1.0.0)
concurrent-ruby (1.1.6)
console (1.8.2)
cool.io (1.6.0)
csv (default: 1.0.0)
date (default: 1.0.0)
etc (default: 1.0.0)
ext_monitor (0.1.2)
fcntl (default: 1.0.0)
fileutils (default: 1.0.2)
fluent-config-regexp-type (1.0.0)
fluent-plugin-cloudwatch-logs (0.10.2)
fluent-plugin-prometheus (1.8.3)
fluent-plugin-rewrite-tag-filter (2.3.0)
fluent-plugin-s3 (1.4.0)
fluentd (1.9.1)
http_parser.rb (0.6.0)
ipaddr (default: 1.2.0)
jmespath (1.4.0)
json (2.3.0)
msgpack (1.3.3)
nio4r (2.5.2)
oj (3.8.1)
openssl (default: 2.1.2)
prometheus-client (0.9.0)
protocol-hpack (1.4.2)
protocol-http (0.13.1)
protocol-http1 (0.10.2)
protocol-http2 (0.10.4)
psych (default: 3.0.2)
quantile (0.2.1)
scanf (default: 1.0.0)
serverengine (2.2.1)
sigdump (0.2.4)
stringio (default: 0.0.1)
strptime (0.2.3)
strscan (default: 1.0.0)
timers (4.3.0)
tzinfo (2.0.2)
tzinfo-data (1.2019.3)
webrick (default: 1.4.2)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)

fluent-plugins-nursery / fluent-plugin-cloudwatch-logs