fluent / fluent-plugin-opensearch

OpenSearch Plugin for Fluentd
Apache License 2.0
58 stars 20 forks source link

If fluent-plugin-opensearch faied to refresh `@_aws_credentials`, it won't refresh `@_aws_credentials` anymore #129

Closed aYukiSekiguchi closed 1 month ago

aYukiSekiguchi commented 9 months ago

(check apply)

Steps to replicate

There is no reliable steps to replicate.

When it failed to refresh @_aws_credentials like the following error log:

2024-02-23 22:16:07 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:out_opensearch_expire_credentials error_class=RuntimeError error="No valid AWS credentials found."
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:252:in `aws_credentials'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:353:in `block (2 levels) in configure'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `synchronize'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluent-plugin-opensearch-1.1.4/lib/fluent/plugin/out_opensearch.rb:351:in `block in configure'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run_once'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/cool.io-1.8.0/lib/cool.io/loop.rb:88:in `run'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-02-23 22:16:07 +0000 [error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

It stopped to refresh with dumping the following log:

2024-02-23 22:16:07 +0000 [error]: #0 Timer detached. title=:out_opensearch_expire_credentials

Therefore, it will fail to flush the buffer with The security token included in the request is expired error message in the future.

FYI: The following is my config, but I don't think this depends on config.

<match apiserver>
  @type copy
  <store>
    @type s3
    <!-- skip -->
  </store>
  <store>
    @type opensearch
    bulk_message_request_threshold 6m
    request_timeout 90s
    resurrect_after 5s
    reload_connections false
    logstash_format true
    logstash_prefix apiserver
    logstash_dateformat %Y.%m.%d
    suppress_type_name true
    time_key time
    include_tag_key true
    tag_key @tag
    id_key _hash
    remove_keys _hash
    <buffer>
      @type file
      path /var/log/fluent/buffer/os/apiserver
      chunk_limit_size 60m
      flush_mode interval
      flush_interval 10s
      flush_at_shutdown true
    </buffer>
    <endpoint>
      url <URL to AWS OpenSearch Service>
      region ap-northeast-1
    </endpoint>
  </store>
</match>

Expected Behavior or What you need to ask

I'm not sure whether this is bug, but I want fluent-plugin-opensearch to refresh @_aws_credentials at the next refresh_credentials_interval. I guess AssumeRoleCredentials.new() failes if a network is unstable. If this happens, fluent-plugin-opensearch stops sending logs. I'm not happy with this.

The reason why fluent-plugin-opensearch stops to refresh @_aws_credentials is that timer_execute() removes the timer if its block raises an exeption. https://github.com/fluent/fluentd/blob/2b4ca5d2927b706c3bdc98ffd0a0b66232bc0b65/lib/fluent/plugin_helper/timer.rb#L84-L85

Using Fluentd and OpenSearch plugin versions

aYukiSekiguchi commented 9 months ago

We are running 6 instances with this plugin for about 1 month. We faced this bug in 3 out of 6 instances. Therefore, this isn't rare problem.

davidpsv17 commented 6 months ago

It is happening the same to me with the same plugin version

akhil31415 commented 6 months ago

@ashie san, Could you please confirm if there's any update for this issue?

ntopee commented 3 months ago

This is similar to #110 , we are experiencing the same issue. In our case, once in a while there is a network timeout in some regions while connecting to sts for the aws token, which raises the error that stops the timer, with no option to recover other than manually restarting the pods.

aYukiSekiguchi commented 3 months ago

FYI: My quick and dirty fix https://github.com/aYukiSekiguchi/fluent-plugin-opensearch/commits/dont_stop_refresh_aws_credentials/

You can build and install like the following

$ fluent-gem build fluent-plugin-opensearch.gemspec
$ sudo fluent-gem install fluent-plugin-opensearch
cosmo0920 commented 2 months ago

Hi @aYukiSekiguchi, Could you send your patch as a PR? It seems it's one of the good workaround to mitigate this issue.

aYukiSekiguchi commented 2 months ago

Sure. I created a PR: https://github.com/fluent/fluent-plugin-opensearch/pull/142

cosmo0920 commented 1 month ago

This should be fixed in #142.