Closed fabio-viana closed 1 year ago
Same here.
My Fluentd pod uses custom IAM roles for Service Accounts. This role's maxSessionDuration
is currently set to 1h:
apiVersion: ...
kind: IRSA
metadata:
name: fluentd-os-test
namespace: fluent-system
annotations:
XXXX: managed
spec:
serviceAccount: fluentd
path: ${IRSA_ROLE_PATH:=/XXX/}
# increasing this to sync with the fluent-plugin-opensearch latest update: https://github.com/fluent/fluent-plugin-opensearch/pull/78/files
# it set the default fluentd session duration to 5 hours
# our default maxSessionDuration was 1 hour, now it is 5 hours
maxSessionDuration: 3600 # 1 hour
inlinePolicy: |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "...",
"Action": "...",
"Resource": "..."
}
]
}
And the Fluentd ClusterOutput is set to refresh_credentials_interval 48m
:
apiVersion: fluentd.fluent.io/v1alpha1
kind: ClusterOutput
metadata:
name: opensearch
labels:
output.fluentd.fluent.io/enabled: "true"
output.fluentd.fluent.io/tenant: "core"
spec:
outputs:
- customPlugin:
config: |
<match **>
@type copy
<store>
@type opensearch
host "${FLUENT_OPENSEARCH_HOST}"
port 443
logstash_format true
logstash_prefix logs-core
scheme https
log_os_400_reason true
@log_level ${FLUENTD_OUTPUT_LOGLEVEL:=error}
<buffer>
@type ${FLUENTD_BUFFER_TYPE:=memory}
path ${FLUENTD_BUFFER_PATH:=/buffers/opensearch/raas-core}
flush_mode ${FLUENTD_BUFFER_FLUSH_MODE:=interval}
flush_interval ${FLUENTD_BUFFER_FLUSH_INTERVAL:=60s}
flush_thread_count ${FLUENTD_BUFFER_FLUSH_THREAD_COUNT:=2}
flush_at_shutdown ${FLUENTD_BUFFER_FLUSH_AT_SHUTDOWN:=true}
retry_type ${FLUENTD_BUFFER_RETRY_TYPE:=exponential_backoff}
retry_max_times ${FLUENTD_BUFFER_RETRY_MAX_TIMES:=10}
retry_wait ${FLUENTD_BUFFER_RETRY_WAIT:=1s}
retry_max_interval ${FLUENTD_BUFFER_RETRY_MAX_INTERVAL:=60s}
chunk_limit_size ${FLUENTD_BUFFER_CHUNK_LIMIT_SIZE:=8M}
total_limit_size ${FLUENTD_BUFFER_TOTAL_LIMIT_SIZE:=512MB}
overflow_action ${FLUENTD_BUFFER_OVERFLOW_ACTION:=throw_exception}
compress ${FLUENTD_BUFFER_COMPRESS:=text}
</buffer>
<endpoint>
url "https://${FLUENT_OPENSEARCH_HOST}"
region "${FLUENT_OPENSEARCH_REGION}"
assume_role_arn "#{ENV['AWS_ROLE_ARN']}"
assume_role_web_identity_token_file "#{ENV['AWS_WEB_IDENTITY_TOKEN_FILE']}"
refresh_credentials_interval 48m
</endpoint>
</store>
</match>
I have also tried to set the IAM role maxSessionDuration
to 5h and refresh_credentials_interval
to 5h, the default value.
It worked for a few minutes, then went back to the same problem. It's been more than 24h without indexing logs.
Some fluentd pods are logging this:
2023-06-30 12:54:35 +0000 [error]: #0 Hit limit for retries. dropping all chunks in the buffer queue. retry_times=10 records=35 error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"XXXX\", :port=>443, :scheme=>\"https\"}): [403] {\"message\":\"The security token included in the request is expired\"}"
2023-06-30 12:54:35 +0000 [error]: #0 suppressed same stacktrace
2023-06-30 12:59:40 +0000 [error]: #0 Hit limit for retries. dropping all chunks in the buffer queue. retry_times=10 records=635 error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"XXXXX\", :port=>443, :scheme=>\"https\"}): [403] {\"message\":\"The security token included in the request is expired\"}"
2023-06-30 12:59:40 +0000 [error]: #0 suppressed same stacktrace
And some pods are logging this:
level=error msg="Fluentd exited" error="exit status 1"
level=info msg=backoff delay=0s
level=info msg="backoff timer done" actual=28.33µs expected=0s
level=info msg="Fluentd started"
2023-06-30 12:55:12 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/seahorse/client/plugins/raise_response_errors.rb:17:in `call': The requested DurationSeconds exceeds the MaxSessionDuration set for this role. (Aws::STS::Errors::ValidationError)
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/plugins/param_converter.rb:26:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/seahorse/client/plugins/request_callback.rb:71:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/plugins/response_paging.rb:12:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/seahorse/client/plugins/response_target.rb:24:in `call'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/seahorse/client/request.rb:72:in `send_request'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-sts/client.rb:1575:in `assume_role_with_web_identity'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/assume_role_web_identity_credentials.rb:76:in `refresh'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/refreshing_credentials.rb:30:in `initialize'
from /usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.175.0/lib/aws-sdk-core/assume_role_web_identity_credentials.rb:64:in `initialize'
from /usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.1/lib/fluent/plugin/out_opensearch.rb:249:in `new'
from /usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.1/lib/fluent/plugin/out_opensearch.rb:249:in `aws_credentials'
from /usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.1/lib/fluent/plugin/out_opensearch.rb:351:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin.rb:187:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/multi_output.rb:110:in `block in configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/multi_output.rb:99:in `each'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/multi_output.rb:99:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/out_copy.rb:39:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin.rb:187:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/agent.rb:132:in `add_match'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/agent.rb:74:in `block in configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/agent.rb:64:in `each'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/agent.rb:64:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/label.rb:31:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/root_agent.rb:146:in `block in configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/root_agent.rb:146:in `each'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/root_agent.rb:146:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/engine.rb:105:in `configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/engine.rb:80:in `run_configure'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/supervisor.rb:731:in `run_supervisor'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/lib/fluent/command/fluentd.rb:350:in `<top (required)>'
from <internal:/usr/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
from <internal:/usr/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
from /usr/lib/ruby/gems/3.1.0/gems/fluentd-1.15.3/bin/fluentd:15:in `<top (required)>'
from /usr/bin/fluentd:25:in `load'
from /usr/bin/fluentd:25:in `<main>'
Since the fluent-operator doesn't pin the plugin's version: https://github.com/fluent/fluent-operator/blob/master/cmd/fluent-watcher/fluentd/base/Dockerfile#L43
I can't even rollback the plugin's version to the previous one that worked. Locked in v1.1.1. The whole logging-system affected.
I'm experiencing the same issue mentioned above, and it's significantly impacting my environments. It's crucial that this bug gets resolved as quickly as possible since it's directly affecting the project quality.
Hi thanks for your reports. I reverted the passing duration second behavior in v1.1.3.
Thank you, the problem is fixed with the new release.
Expected Behavior or What you need to ask
When using the
refresh_credentials_interval
configuration option, the specified value does not take effect in the underlying AWS SDK. As a result, an error is consistently encountered:The requested DurationSeconds exceeds the MaxSessionDuration set for this role (Aws::STS::Errors::ValidationError)
Additional Information
Reverting back to the previous version of fluent-plugin-opensearch resolves the issue. The role used has a maximum session duration of 1 hour. Various
refresh_credentials_interval
values, including the minimum allowed (e.g., "15m", "30m"), were tested without success.Complete error logs message:
Using Fluentd and OpenSearch plugin versions