Open alexmbird opened 3 years ago
@PettitWesley, still no appearance of the '...sequence token in response..' errors after another 24 hours of operation.
FYI: I do still see occasional 'Failed to send log events' errors from each of the fluentbit instances:
[2021/09/15 17:04:35] [ warn] [http_client] malformed HTTP response from logs.us-east-1.amazonaws.com:443 on connection #165
[2021/09/15 17:04:35] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/09/15 17:04:35] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/09/15 17:04:35] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events
@byrneo Awesome! I will put up my patch to upstream then... we've also recently found some other upstream networking bugs recently that I suspect might be involved in this issue as well... so while this issue has been open for a very long time... I am finally feeling hopeful that we will soon have a resolution. It'll take a little bit of time to get these fixes merged and released upstream and incorporated into our distro.
Apologies to everyone that the resolution here has/is taken so long.
@PettitWesley I've updated to the 2.19.1 image but I still see errors and gaps in logs of my EKS applications :
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
fluent-bit-9vs2j fluent-bit [2021/09/23 15:50:51] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send events
Here is my fluentbit config :
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server ${HTTP_SERVER}
HTTP_Listen 0.0.0.0
HTTP_Port ${HTTP_PORT}
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
application-log.conf: |
[INPUT]
Name tail
Tag application.*
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels On
Annotations Off
[OUTPUT]
Name cloudwatch
Match application.*
region ${AWS_REGION}
default_log_group_name FluentBit-Application
log_group_name /aws/containerinsights/${CLUSTER_NAME}/$(kubernetes['namespace_name'])/$(kubernetes['labels']['app'])
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
dataplane-log.conf: |
[INPUT]
Name systemd
Tag dataplane.systemd.*
Systemd_Filter _SYSTEMD_UNIT=docker.service
DB /var/fluent-bit/state/systemd.db
Path /var/log/journal
Read_From_Tail ${READ_FROM_TAIL}
[INPUT]
Name tail
Tag dataplane.tail.*
Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
DB /var/fluent-bit/state/flb_dataplane_tail.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name modify
Match dataplane.systemd.*
Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit
Rename MESSAGE message
Remove_regex ^((?!hostname|systemd_unit|message).)*$
[FILTER]
Name aws
Match dataplane.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match dataplane.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}-fluentbit/dataplane
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
host-log.conf: |
[INPUT]
Name tail
Tag host.dmesg
Path /var/log/dmesg
Parser syslog
DB /var/fluent-bit/state/flb_dmesg.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.messages
Path /var/log/messages
Parser syslog
DB /var/fluent-bit/state/flb_messages.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.secure
Path /var/log/secure
Parser syslog
DB /var/fluent-bit/state/flb_secure.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name aws
Match host.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match host.*
region ${AWS_REGION}
log_group_name ${CLUSTER_NAME}-fluentbit/host
log_stream_prefix ${HOST_NAME}.
auto_create_group true
extra_user_agent container-insights
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name syslog
Format regex
Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name container_firstline
Format regex
Regex (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name cwagent_firstline
Format regex
Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
@byrneo @Funkerman1992 and everyone- update on this.
So basically we have two fixes in progress:
All of these fixes will take some time to make their way upstream. Right now, everyone can use our branches and pre-release/test builds if they want.
Code is here: https://github.com/krispraws/fluent-bit/commits/v1_7_5_openssl_fix
Image is here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5
Pull it with:
ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5
Code is here: https://github.com/PettitWesley/fluent-bit/tree/v1_7_5_openssl_fix_sequence_token
Image is here: 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap
Pull it with:
ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap
Hope this helps/let me know what you see.
@PettitWesley We have hit this issue on a client project. The workaround of flush 5 and chunk to 1mb worked, but that would not withstand the prod peak load. So, eagerly waiting for your fix. Could you please let us know when this fix would be pushed to upstream docker image (aws-for-fluent-bit)? or did you already include this fix on aws-for-fluent-bit:2.20.0 ?
@rpalanisamy The networking fixes have been included in 2.20.0. Please try out that version.
The sequence token stop gap wasn't included in that release, may be a future one soon. I'm hoping that solving the networking issues will solve this, and that the stop gap fix won't be needed.
@PettitWesley - i'm still using your stop-gap fix 144718711470.dkr.ecr.us-west-2.amazonaws.com/core-network-fixes:1.7.5-sequence-token-stop-gap
Are the fixes now available in any official releases? Would you now recommend upgrading to v2.23.3
@byrneo Yea, this was contributed upstream and its much safer to use the newest release than my old image
Hi folks,
We've been running the 2.12.0 release to ship our logs to CloudWatch with the new
cloudwatch_logs
plugin. We've been waiting for the fix to renewing STS tokens so this is our first outing with it.After running reliably for several hours, several of our pods have crashed with:
After that it exits with an error status and Kubernetes replaces the pod.
Curiously, several replicas of fluentbit failed with the same error at once. This makes me wonder if the CloudWatch API was briefly unavaiable. But if so, I'd expect the behaviour to be that it retries rather than taking down the whole fluentbit replica.