Open danielejuan-metr opened 1 year ago
Hey , could you go over the debugging guide to see if this will help.
For info on metrics emitted by Fluent Bit: https://docs.fluentbit.io/manual/administration/monitoring#metric-descriptions
Additional info - AWS EKS version is 1.22 so still using dockershim.
With Docker, k8s uses this log driver: https://docs.docker.com/config/containers/logging/json-file/
It has max-file and max-size settings. It is possible to lose logs because:
Hi @PettitWesley, thanks for the response. We have considered the scenarios in your list. Hoping you can help identify if we have incorrect expectations or assumptions.
For scenario 1, we use filesystem buffering for our tail input. Assuming that there we no new logs written, we expect that the internal buffers of fluentbit will be empty if the pipeline is properly sending out logs. However, we see high storage size of the tail buffer directory and high number of chunks down in the metrics. We cannot identify why the buffer was not being consumed and chunks being sent to s3.
For scenario 2, we expect to have some logs uploaded to S3 if fluentbit could not keep up with the amount of logs going in. However, during the issue there were totally no logs uploaded to S3.
Please see our log loss investigation runbook: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log-loss-investigation-runbook
Hi @PettitWesley, with the same environment and configuration above we are now trying to test with high log load.
At 8 MB/s application log throughput, we are encountering the following fluent-bit errors and missing log entries in S3:
[2023/08/29 16:53:36] [error] [tls] error: error:00000005:lib(0):func(0):DH lib
[2023/08/29 16:53:36] [error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
[2023/08/29 16:53:36] [error] [plugins/in_tail/tail_file.c:1432 errno=2] No such file or directory
[2023/08/29 16:53:36] [error] [plugins/in_tail/tail_fs_inotify.c:147 errno=2] No such file or directory
[2023/08/29 16:53:36] [error] [input:tail:tail.0] inode=16798413 cannot register file /var/log/containers/container-5fcd9d46b5-vgsk6_container-9999550bf6611e4082208c7eaee6b4d8f1784316b6861e2afe06d224b330341f.log
During this test we are also seeing connection errors to Splunk.
Can this be a symptom of scenario 2?
We have seen similar logs on this ticket: https://github.com/fluent/fluent-bit/issues/3039, will you be able to verify if the latest stable of aws-fluenbit has the patch done on the ticket?
We are unable to find documentation in https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md about the "cannot register file" error.
I'm not sure to be honest, sorry. The first two errors are very common and not to worry about necessarily: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#common-network-errors
We're seeing something similar. Randomly, our fluentbit container gets "stuck". In addition to not routing logs to the destination (s3), logs for the fluentbit container itself also stop being emitted. It's like the process dies, or hangs on something.
No noticeable spikes in resources. The container didn't OOM.
Logs:
2024-01-03 23:12:49 UTC | TRACE | INFO | (pkg/trace/info/stats.go:91 in LogAndResetStats) | No data received
[2024/01/03 23:12:55] [debug] [output:s3:s3.1] Running upload timer callback (cb_s3_upload)..
Any pointers?
Describe the question/issue
May we request for assistance on how to go about the incident we encountered in our development environment. After some time running fluentbit in EKS, fluentbit suddenly stopped sending files to S3 for tags that matches
application.*
as described in the config below. There were no updates/activities to the fluentbit application.We were expecting 'Successfully uploaded object' in fluentbit logs for the application logs, however we were only able to see successful uploads for dataplane and host logs.
We were unable to find similar Issues here in git, hoping you can advise on steps to possibly reproduce the issue.
Configuration
Fluent Bit Log Output
During the issue, we were expecting to see "Successfully uploaded object" in fluentbit logs but we were not able to. Our log level was at
info
so we decided to update it todebug
to investigate. After restarting the issue did not occur anymore and fluentbit started sending backlogs.We were not able to see error logs in fluentbit during the issue.
Logs after restarting:
Metrics during the issue
We also noticed that the storage utilization of fluentbit is high:
Fluent Bit Version Info
Cluster Details
Application Details
Very low traffic since the issue was encountered in a dev environment.
Steps to reproduce issue
We are unable to reproduce the issue
EDIT:
Additional info - AWS EKS version is 1.22 so still using dockershim.