Open domokap opened 1 year ago
Here is the hex code for the corrupted version:
Your corrupted message looks like MessagePack binary to me. In particular, here is what it reads like:
5C 6E AA 40 74 69 6D 65 73 74 61 6D 70 BD 32 30 32 32 2D 31
\ n == @ t i m e s t a m p == 2 0 2 2 - 1
fixstr (10 char) fixstr (29 char)
This is really strange. I cannot think of any plausible code path that can leak raw msgpack binary in this manner. Can you post your entire configuration file here?
<source>
@type monitor_agent
bind 0.0.0.0
port 24220
</source>
<system>
workers 8
log_level info
</system>
# Beats protocol input (beats agents)
<source>
@type beats
metadata_as_tag
port 5044
bind 0.0.0.0
</source>
# Fluent protocol input
<source>
@type forward
tag fluentd
port 5170
bind 0.0.0.0
</source>
<source>
@type forward
port 5270
bind 0.0.0.0
<parse>
@type json
</parse>
<transport tls>
version TLSv1_2
cert_path /etc/fluentd/telemetry.nbcuott.com.pem
private_key_path /etc/fluentd/prod_telemetry.nbcuott.com.key
private_key_passphrase ""
</transport>
tag fluentd_gcp
</source>
<filter **>
@type record_transformer
enable_ruby true
<record>
cloud.region ${record.dig("cloud", "region") || record['cloud.region'] || 'us-west-2'}
cloud.provider ${record.dig("cloud", "provider") || record['cloud.provider'] || 'aws'}
service.name ${record.dig("service", "name") || record['service.name'] || record.dig("kubernetes", "labels", "app") || record.dig('kubernetes.labels.app') || '_unnamed_service'}
sky.top_tenant ${record.dig("sky", "top_tenant") || record['sky.top_tenant'] || 'softfail'}
agent.type ${record.dig("agent", "type") || record['agent.type'] || 'unknown'}
</record>
</filter>
# Output FluentD debug messages/logs to stdout (fluentd.log)
<label @FLUENT_LOG>
<match fluent.*>
@type stdout
@id fluentd_logs_stdout
</match>
</label>
<match {fluentd,filebeat,functionbeat,metricbeat,fluentd_gcp,heartbeat}>
@type filter_list
key_to_filter sky.top_tenant
pattern_file_paths /etc/fluentd/tenants_list.txt
filter_empty true
action whitelist
<retag>
add_prefix tenant_in_mono
</retag>
<retag_filtered>
add_prefix dedicated
</retag_filtered>
</match>
<match {dedicated.fluentd,dedicated.fluentd_gcp,dedicated.filebeat,dedicated.functionbeat,dedicated.metricbeat,dedicated.heartbeat}>
@type rewrite_tag_filter
@log_level info
<rule>
key sky.top_tenant
pattern sas-apm
tag sas-apm.${tag}
</rule>
<rule>
key sky.top_tenant
pattern darp
tag darp.${tag}
</rule>
<rule>
key sky.top_tenant
pattern stratos
tag stratos.${tag}
</rule>
<rule>
key sky.top_tenant
pattern njp
tag njp.${tag}
</rule>
<rule>
key sky.top_tenant
pattern blackbird
tag blackbird.${tag}
</rule>
<rule>
key sky.top_tenant
pattern lakitu
tag lakitu.${tag}
</rule>
<rule>
key sky.top_tenant
pattern loki
tag loki.${tag}
</rule>
<rule>
key sky.top_tenant
pattern kratos
tag kratos.${tag}
</rule>
<rule>
key sky.top_tenant
pattern olisipo
tag olisipo.${tag}
</rule>
<rule>
key sky.top_tenant
pattern mytv
tag mytv.${tag}
</rule>
<rule>
key sky.top_tenant
pattern /.+/
tag softfail_cluster.${tag}
</rule>
</match>
<match {*.dedicated.fluentd,*.dedicated.fluentd_gcp,tenant_in_mono.fluentd,tenant_in_mono.fluentd_gcp}>
@type rewrite_tag_filter
@log_level info
<rule>
key agent.type
pattern metricbeat
tag metricdocument.${tag}
</rule>
<rule>
key agent.type
pattern /.+/
tag logdocument.${tag}
</rule>
</match>
<match {logdocument.tenant_in_mono.fluentd,tenant_in_mono.filebeat,tenant_in_mono.functionbeat,logdocument.tenant_in_mono.fluentd_gcp}>
@type copy
@log_level info
<store>
@type elasticsearch
reconnect_on_error true
reload_on_failure true
reload_connections false
max_retry_putting_template 1
request_timeout 60s
fail_on_putting_template_retry_exceed false
slow_flush_log_threshold 100.0
@id out_es_logs-tenant_in_mono
@log_level info
log_es_400_reason true
id_key _hash
remove_keys _hash
hosts ********
user "elastic"
password "********"
ca_file "/etc/fluentd/ca-peacock.crt"
ssl_version TLSv1_2
ssl_verify false
index_name logs-${sky.top_tenant}-fluentd
time_key time
include_timestamp true
include_tag_key true
flatten_hashes false
flatten_hashes_separator _
# Rollover index config
rollover_index true
application_name default
index_date_pattern "now/d"
deflector_alias logs-${sky.top_tenant}-fluentd
# Index template
template_name logs-${sky.top_tenant}-fluentd
template_file /etc/fluentd/logs-template.json
customize_template {"<<TAG>>":"${sky.top_tenant}"}
template_overwrite true
<buffer tag,sky.top_tenant>
retry_wait 20s
retry_exponential_backoff_base 2
retry_type exponential_backoff
retry_max_interval 300s
disable_chunk_backup true
@type file
path /fluentd/es-out-logs-tenant_in_mono
flush_thread_count 8
flush_interval 5s
flush_at_shutdown true
overflow_action block
chunk_limit_size 16M
# total_limit_size is set 70% of the data disk do that 1 single out can't use more than this
total_limit_size 69G
retry_forever false
</buffer>
</store>
<store ignore_error>
@type s3
@log_level info
s3_bucket "nbcu-topslogs-prod-us-west-2"
s3_region "us-west-2"
acl bucket-owner-full-control
store_as "json"
<format>
@type json
</format>
use_server_side_encryption "AES256"
path ${sky.top_tenant}/${service.name}/%Y/%m/%d/
# if you want to use ${tag} or %Y/%m/%d/ like syntax in path / s3_object_key_format,
# need to specify tag for ${tag} and time for %Y/%m/%d in <buffer> argument.
<buffer service.name,sky.top_tenant,time>
retry_wait 20s
retry_exponential_backoff_base 2
retry_type exponential_backoff
retry_max_interval 300s
disable_chunk_backup true
@type file
path /fluentd/s3-out-tenant_in_mono
timekey 30 # 30 seconds partition
timekey_wait 1m
timekey_use_utc true # use utc
flush_thread_count 8
chunk_limit_size 256M
# total_limit_size is set 70% of the data disk do that 1 single out can't use more than this
total_limit_size 69G
</buffer>
</store>
</match>
The above for tenant_in_mono
is repeated for every pattern of sky.top_tenant
near identically so I've cut those out. I've also hashed hosts and passwords for obvious reasons.
@domokap I cheked your configuration. I think your configuration is basically correct.
Considering that Elasticsearch receives the same entry fine (despite corruption in S3), I suspect there is some racy bug in S3.
In S3:
\n�@timestamp�2022-10-15T00:16:01.074+00:00�@version\u0001�message�XMaking
In Elasticsearch):
"@timestamp\":\"2022-10-14T23:16:01.074+00:00\",\"@version\":1,\"message\":\"Making
I'll find some time trying to reproduce this issue. WFM.
@fujimotos thank you. I had ruled out that it is coming from upstream but I will endeavour to double check again with our logging teams.
Thanks again.
Hi @fujimotos I found some more weird related behaviour with the S3 plugin when it is creating the bucket paths. In the config we have path ${sky.top_tenant}/${service.name}/%Y/%m/%d/
. Here is a summary of the folder at sky.top_tenant
of service.name
folders:
Note however that the only two valid service.name
and therefore folder names should be mytv-service
and mytv-service-recs
. So the other ones must be a result of corruption(?).
Several of us suspect that it may be due to a race condition(?).
Some extra context would be that in this proposition and environment we have 42 EC2 nodes running Fluentd.
Hope this extra information helps.
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days
Describe the bug
We receive roughly 2 billion logs through our Fluentd every 24h, which are routed to Elasticsearch and also to S3 buckets.
Within the S3 plugin we have a
timekey
of 30s, which results in roughly 250-300k JSON objects created every 24h in our S3 buckets. Between 100-200 of these contain some sort of corruption within the object in s3 - usually non UTF-8 characters where they shouldn't be, missing characters, or jumbled strings.To Reproduce
We have been unable to manually reproduce this on demand using
fluent-cat
to ingest a log which ended up with corruption in S3.We are happy to provide more examples if required.
Expected behavior
The logs within the JSON objects in S3 contain no corruption or non UTF-8 characters.
Your Environment
Your Configuration
Your Error Log
Additional context
Here is the hex code for the corrupted version: