Open alexmbird opened 3 years ago
Hey @alexmbird a few questions:
Hey there,
Can you share your full config
Sure. I'd rather not post it publicly so have you an email address I can use?
When you say other replicas failed with the same error, is the full output the same?
The outputs are similar but not quite identical. Here are ~two~three more examples:
$ kubectl -n cluster-fluentbit logs --previous fluent-bit-csdp4
[2021/03/09 21:46:00] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/09 21:46:00] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 21:46:00] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/09 21:46:00] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/09 21:46:01] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[lib/chunkio/src/cio_file.c:786 errno=9] Bad file descriptor
[2021/03/09 21:46:01] [error] [storage] [cio_file] error setting new file size on write
[2021/03/09 21:46:01] [error] [input chunk] error writing data from tail.5 instance
[2021/03/09 21:46:01] [ warn] [engine] failed to flush chunk '1-1615326351.539869750.flb', retry in 7 seconds: task_id=1, input=tail.5 > output=cloudwatch_logs.5 (out_id=5)
$ kubectl -n cluster-fluentbit describe po/fluent-bit-csdp4
...
Last State: Terminated
Reason: Error
Exit Code: 135
This one explicitly mentions a ServiceUnavailableException
...
$ kubectl -n cluster-fluentbit logs --previous fluent-bit-8cpsn
[2021/03/10 12:09:04] [error] [aws_client] connection initialization error
[2021/03/10 12:09:04] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send log events
[2021/03/10 12:09:04] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send log events
[2021/03/10 12:09:04] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send events
[2021/03/10 12:09:05] [ warn] [engine] failed to flush chunk '1-1615378132.172180559.flb', retry in 6 seconds: task_id=1, input=tail.6 > output=cloudwatch_logs.4 (out_id=4)
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inode=229642033 handle rotation(): /var/log/containers/<redacted>.log => /var/lib/docker/containers/<redacted>.log.1
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inotify_fs_remove(): inode=229642033 watch_fd=25
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inotify_fs_add(): inode=229642033 watch_fd=28 name=/var/lib/docker/containers/<redacted>-json.log.1
[2021/03/10 12:14:07] [ info] [input:tail:tail.7] inotify_fs_add(): inode=229642035 watch_fd=29 name=/var/log/containers/<redacted>.log
[2021/03/10 12:14:50] [ info] [input:tail:tail.7] inotify_fs_remove(): inode=229642033 watch_fd=28
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] PutLogEvents API responded with error='ServiceUnavailableException', message='The server failed to fulfill the request. Please try again.'
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/10 13:17:53] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/10 13:17:53] [ warn] [engine] failed to flush chunk '1-1615382259.778215743.flb', retry in 11 seconds: task_id=4, input=tail.7 > output=cloudwatch_logs.5 (out_id=5)
$ kubectl -n cluster-fluentbit describe po/fluent-bit-8cpsn
...
Last State: Terminated
Reason: Error
Exit Code: 135
And this one is broadly similar...
$ kubectl -n cluster-fluentbit logs --previous fluent-bit-njw9l
[2021/03/10 22:02:51] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=299900464 watch_fd=35
[2021/03/10 22:07:33] [ info] [input:tail:tail.5] inode=292557536 handle rotation(): /var/log/containers/<redacted>.log => /var/lib/docker/containers/<redacted>-json.log.1
[2021/03/10 22:07:33] [ info] [input:tail:tail.5] inotify_fs_add(): inode=292557537 watch_fd=37 name=/var/log/containers/<redacted>.log
[2021/03/10 22:08:22] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=292557536 watch_fd=34
[2021/03/10 22:10:10] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Could not find sequence token in response: response body is empty
[2021/03/10 22:10:11] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
$ kubectl -n cluster-fluentbit describe po/fluent-bit-njw9l
...
Last State: Terminated
Reason: Error
Exit Code: 135
There's no more error output? It just crashes after that? No stacktrace and no signal information?
Affirmative - the output ends abruptly
Hi Folks
I'm having this error too , do you know if there is an other version that works I'm working with latest, and any update in this issue
@alexmbird Sorry that I missed your update.
Sure. I'd rather not post it publicly so have you an email address I can use?
Thing is... I'd also rather not share my work email publicly. Can you DM me on twitter? https://twitter.com/TheWesleyPettit I'm also on Fluent Slack group.
@alexmbird @juanmanuelito Is this issue consistent and repeatable or is it occasional and random? Are you continually getting sequence token errors with this version? Have you tried other versions?
This issue is weird.. we didn't make any code changes recently that affect sequence tokens. I also can't repro it myself.
I have a third report of this exact same issue via an internal system... in that case the customer found that once they tested it again the issue did not return. And they have not experienced it again. The occurrence was around the same time as when @alexmbird saw it.
[2021/03/09 22:22:55] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
This is weird. And it's not clear that it's anything wrong with the plugin code. I am wondering if the CloudWatch Logs API was undergoing some maintenance at this time and there was a brief accidental outage in sequence token processing. That would explain this other report that I have where it was an issue one day, and then the next everything was fine and it never occurred again.
Hi @PettitWesley. I can confirm it's still happening to us - I'd say a couple of dozen crashes a day across our cluster, which rather spams our alerts channel!
I've tried to DM you on Twitter but I get this:
I am @alexhewson2 if you'd like to try.
@alexmbird Ok nvm mind on twitter, I don't use it enough and don't know how it actually works. I checked and I realized I did create an email address for aws fluent bit, which I am pretty sure is open to receive external email: aws-for-fluent-bit@amazon.com
. Try that.
I can confirm it's still happening to us
Just to be clear- the sequence token error messages are recurring? Or just that fluent bit is crashing?
I've sent a copy of the config to aws-for-fluent-bit@amazon.com
.
Yes, I'm still getting errors about sequence tokens and shortly after fluentbit crashes. Here's the output from the most recent one, which happened two hours ago.
$ kubectl -n cluster-fluentbit logs --previous fluent-bit-rvmk6
...
[2021/03/26 07:07:47] [ warn] [output:cloudwatch_logs:cloudwatch_logs.2] [size=367370] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:07:51] [error] [src/flb_http_client.c:1163 errno=32] Broken pipe
[2021/03/26 07:07:51] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/26 07:07:51] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send log events
[2021/03/26 07:07:51] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/03/26 07:07:58] [ warn] [output:cloudwatch_logs:cloudwatch_logs.2] [size=323578] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:07:58] [ warn] [output:cloudwatch_logs:cloudwatch_logs.2] [size=302934] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:08:01] [ warn] [output:cloudwatch_logs:cloudwatch_logs.5] [size=323578] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:08:01] [ warn] [output:cloudwatch_logs:cloudwatch_logs.5] [size=302934] Truncating event which is larger than max size allowed by CloudWatch
[lib/chunkio/src/cio_file.c:961 errno=22] Invalid argument
[2021/03/26 07:08:01] [error] [storage] [cio file sync] error adjusting size at: tail.6/1-1616742448.432613720.flb
[2021/03/26 07:08:01] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Could not find sequence token in response: response body is empty
[2021/03/26 07:08:02] [ warn] [output:cloudwatch_logs:cloudwatch_logs.5] [size=267670] Truncating event which is larger than max size allowed by CloudWatch
[2021/03/26 07:08:02] [ info] [input:tail:tail.5] inotify_fs_remove(): inode=108007185 watch_fd=172
[2021/03/26 07:08:02] [engine] caught signal (SIGSEGV)
I've examined the output from this pod in more detail and it looks like the "Could not find sequence token" error happened a few times during its lifetime, not just immediately before it crashed. It might therefore be that we're looking at two separate issues.
@alexmbird the sequence token thing is very strange and I am doubtful its the cause of the crash. I also am struggling to figure out how Fluent Bit could cause this problem, seems more likely its the CW API. I am wondering if the crash is actually caused by this: https://github.com/fluent/fluent-bit/issues/2933
I have two ideas to try:
144718711470.dkr.ecr.us-west-2.amazonaws.com/valgrind-2_12_0:latest
[2021/03/31 16:38:18] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 16:38:51] [ warn] [net] getaddrinfo(host='logs.us-west-2.amazonaws.com'): Name or service not known
[2021/03/31 16:38:51] [error] [aws_client] connection initialization error
[2021/03/31 16:38:56] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/03/31 16:38:56] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/03/31 16:38:56] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events
[2021/03/31 16:38:56] [ warn] [engine] failed to flush chunk '1-1617208732.125323519.flb', retry in 10 seconds: task_id=7, input=tail.4 > output
=cloudwatch_logs.1 (out_id=1)
[2021/03/31 16:39:01] [ info] [input:tail:tail.0] inotify_fs_add(): inode=38951155 watch_fd=20 name=/var/log/containers/cert-manager-cainjector-
8c8c66cf5-kjhnf_cert-manager_cert-manager-cc057e295d360d5eee8052729576b97e5743e0f20c1d92d85a684547f6b99109.log
[2021/03/31 16:39:01] [ info] [input:tail:tail.0] inotify_fs_add(): inode=19266564 watch_fd=21 name=/var/log/containers/coredns-6548845887-8wsr8
_kube-system_coredns-de13282d1e52792a0c259964f66ba4fd13b06d1d11dea4f9ae3e5aa826cb6118.log
[2021/03/31 16:39:01] [ info] [input:tail:tail.0] inotify_fs_add(): inode=13805768 watch_fd=22 name=/var/log/containers/monitoring-kube-promethe
us-operator-56c844c5b8-r8779_monitoring_kube-prometheus-stack-aff30feca6eac1b631e339d1469f710b254eae6cfe3ce3846121bd5f9a0cb681.log
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.cert-manager-cainjector-8c8c66cf5-kjhnf_cert-manager_cert-manager-cc057e295d360d5eee8052729576b97e5743e0f20c1d92d85a68454
7f6b99109.log in log group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.cert-manager-cainjector-8c8c66cf5-kjhnf_cert-manager_cert-manager-cc057e295d360d5eee8052729576b97e5743e0f20c1d92d85a684547
f6b99109.log
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.coredns-6548845887-8wsr8_kube-system_coredns-de13282d1e52792a0c259964f66ba4fd13b06d1d11dea4f9ae3e5aa826cb6118.log in log
group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.coredns-6548845887-8wsr8_kube-system_coredns-de13282d1e52792a0c259964f66ba4fd13b06d1d11dea4f9ae3e5aa826cb6118.log
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.monitoring-kube-prometheus-operator-56c844c5b8-r8779_monitoring_kube-prometheus-stack-aff30feca6eac1b631e339d1469f710b254
eae6cfe3ce3846121bd5f9a0cb681.log in log group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:06] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.monitoring-kube-prometheus-operator-56c844c5b8-r8779_monitoring_kube-prometheus-stack-aff30feca6eac1b631e339d1469f710b254e
ae6cfe3ce3846121bd5f9a0cb681.log
[2021/03/31 16:39:12] [ info] [input:tail:tail.0] inotify_fs_add(): inode=23318123 watch_fd=23 name=/var/log/containers/argocd-application-contr
oller-2_argocd_application-controller-3b3867761391ee260ebdbbf279acaf130c19e44ef43388ef1c4f91cc8cea523c.log
[2021/03/31 16:39:16] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream ip-10-46-72-144.us-west-2.compute.internal-applicat
ion.var.log.containers.argocd-application-controller-2_argocd_application-controller-3b3867761391ee260ebdbbf279acaf130c19e44ef43388ef1c4f91cc8ce
a523c.log in log group /aws/containerinsights/ssprod-or-toolp-eks-cluster-prod-ita-tools/application
[2021/03/31 16:39:16] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Created log stream ip-10-46-72-144.us-west-2.compute.internal-applicati
on.var.log.containers.argocd-application-controller-2_argocd_application-controller-3b3867761391ee260ebdbbf279acaf130c19e44ef43388ef1c4f91cc8cea
523c.log
[2021/03/31 16:39:32] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Could not find sequence token in response: response body is empty
[2021/03/31 17:44:37] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 17:55:13] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 18:21:22] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
[2021/03/31 18:46:33] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Could not find sequence token in response: response body is empty
looks like its unable to recover from connection error
@abhinavkhanna-sf If your company has a Support relationship with AWS, please consider opening a ticket for the sequence token issue.
I do not currently have any reason to believe it is a bug in Fluent Bit or a connection issue. If Fluent Bit was dropping HTTP responses, we would expect to see that problem in other plugins as well, all core plugins use the same http stack.
Also, one option worth considering is trying out the old cloudwatch
plugin and see if you still see the same error. That's the old go plugin, its not as efficient, but if you have time to test I am curious if the same issue occurs with it. If it does, then we know its nothing in the code (go plugin uses entirely different http client) and that it is truly an issue in the CW API.
Just change the name of your output:
[OUTPUT]
Name cloudwatch
...
@PettitWesley I'm trying to run your Valgrind build on our dev environment. Early results look like this:
kubectl -n cluster-fluentbit get po
NAME READY STATUS RESTARTS AGE
fluent-bit-9bfvq 0/1 CrashLoopBackOff 10 23m
fluent-bit-c9ssg 0/1 CrashLoopBackOff 10 23m
fluent-bit-fd4xc 0/1 CrashLoopBackOff 10 23m
fluent-bit-frz8s 1/1 Running 0 23m
fluent-bit-jpx54 0/1 Running 11 23m
fluent-bit-k2tn4 0/1 CrashLoopBackOff 9 23m
fluent-bit-lmp4g 0/1 OOMKilled 11 23m
fluent-bit-pvpqc 0/1 Running 11 23m
fluent-bit-t6x4h 1/1 Running 0 23m
fluent-bit-wzxrw 1/1 Running 0 23m
fluent-bit-zfn8b 0/1 CrashLoopBackOff 10 23m
fluent-bit-znb6x 1/1 Running 0 23m
I'll get back to you with more once I've made sense of these CrashLoopBackOff
s
I've emailed you a log from one of the pods. Curiously it isn't printing a stacktrace before exiting, though there's plenty of output from Valgrind at the start to prove it's working.
Following the second suggestion - I've switched back to the regular image and disabled disk buffering entirely with storage.type memory
. But that's not a satisfactory long-term solution for us: our applications log about 1TB of data a month and for the brief outages we're already seeing to CloudWatch, it's vital that the clusters are able to store everything without dropping data.
FluentBit has run without file buffering for two days without a single crash. I think we've isolated the problem.
Interestingly...
$ kubectl -n cluster-fluentbit logs -l app.kubernetes.io/instance=fluentbit | grep -ci "Could not find sequence token"
0
This is across 30 pods that have been running for two days. I take it to mean either whatever was wrong with CloudWatch's API has been fixed, or that the error was somehow caused by the use of file buffering.
Given our logging volumes disabling file buffering isn't a good long-term solution. I'm already seeing errors like:
[2021/04/03 08:51:31] [error] [input chunk] no enough space in filesystem to buffer chunk 1-1617439891.605614905.flb in plugin cloudwatch_logs.4
...which means log messages really are being thrown away. Not what we need from our production logging system! A quick grep reveals that error has occurred 70 times in the last few days. I'm not sure what part of FluentBit is responsible for handling these file buffers, so is it something you need to look into or should I open an issue in the main repo?
@alexmbird Thank you for the update.
I'm not sure what part of FluentBit is responsible for handling these file buffers, so is it something you need to look into or should I open an issue in the main repo?
Yeah this is not a part of the code that we (AWS) have experience in. Opening an issue upstream is best.
Same problem here.
Out of 10 fluentbit pods, 8 were stuck with this error. restarting the pods magically fixed everything, but only for a while.
Hello guys, I recently encountered this issue and haven't found a solution.
If you have any information, please consider posting it at https://github.com/fluent/fluent-bit/issues/3477 as well, thanks!
Does anyone see Sequence token was invalid, will retry
in their debug log output?
So this sequence token error message happens when the plugin receives a 200 response from the CWL API, but the response body is empty and the next sequence token is not found. This is treated as a success case, since the logs were successfully sent and the API returned 200. The next sequence token would be used in the next request to PutLogEvents. If it uses an older one, then the API should return InvalidSequenceTokenException
and it should emit the above debug message.
I have managed to reproduce the Could not find sequence token in response: response body is empty
error message, but I don't see the above message for InvalidSequenceTokenException
on the next PutLogEvents call. Which is super weird and makes no sense to me.
So I have made some progress in understanding the root cause of this issue. No promises, but we might have a fix by next week.
A stop gap solution is up: https://github.com/fluent/fluent-bit/pull/3605
My fix for this has been released in Fluent Bit 1.7.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.16.0
After deploying version 2.16.0 I'm seeing a lot of these errors in the log, is this something I should be worried about or this is expected behavior? As far as I can tell log events are successfully forwarded to configured destinations.
[2021/06/19 10:22:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
[2021/06/19 10:22:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/06/19 10:22:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2021/06/19 10:22:21] [ warn] [engine] failed to flush chunk '1-1624098140.919796085.flb', retry in 7 seconds: task_id=0, input=forward.0 > output=cloudwatch_logs.0 (out_id=0)
[2021/06/19 10:30:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
[2021/06/19 10:30:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/06/19 10:30:21] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2021/06/19 10:30:21] [ warn] [engine] failed to flush chunk '1-1624098621.197417164.flb', retry in 10 seconds: task_id=0, input=forward.0 > output=cloudwatch_logs.0 (out_id=0)
Thanks
@psyhomb This is expected. The results of my investigation showed that the old sequence token errors should be failures, since they represented cases where we did not get a full response from the CW API, which means we can't trust that the logs were actually received.
So now I make it fail unless we get a full response with a request ID.
I'm not certain why this happens in the first place- why we get a 200 response but it seems to be incomplete. I noticed in my own testing that this 200 response where we can't get the sequence token is never followed by an invalid sequence token exception in the next call. Which almost certainly indicates that CW didn't actually uptake the last batch of logs.
There is some longer term investigation needed here on the deeper root cause, which is something like one of the following options:
However, for the short term, the fix that was just released should ensure there is no log loss.
If anyone sees duplicated logs and this new error message, please let me know, that would be interesting. (This did not occur in my testing).
Also, with these bug reports, can folks please quantify what "a lot" of errors means? In my own testing, I never saw either of these errors more than roughly once per day per task. And that was running a bunch of tasks, all shipping ~1000 TPS logs to CW.
Also, with these bug reports, can folks please quantify what "a lot" of errors means? In my own testing, I never saw either of these errors more than roughly once per day per task. And that was running a bunch of tasks, all shipping ~1000 TPS logs to CW.
Sure thing, I have noticed ~88 of these errors in one hour for a single task. But after that everything has back to normal, I don't see any more errors so far.
@psyhomb This is expected. The results of my investigation showed that the old sequence token errors should be failures, since they represented cases where we did not get a full response from the CW API, which means we can't trust that the logs were actually received.
So now I make it fail unless we get a full response with a request ID.
I'm not certain why this happens in the first place- why we get a 200 response but it seems to be incomplete. I noticed in my own testing that this 200 response where we can't get the sequence token is never followed by an invalid sequence token exception in the next call. Which almost certainly indicates that CW didn't actually uptake the last batch of logs.
There is some longer term investigation needed here on the deeper root cause, which is something like one of the following options:
1. CW API somehow returns an incomplete response sometimes 2. Fluent Bit is somehow incorrectly processing the response and thinks its valid when it isn't.
However, for the short term, the fix that was just released should ensure there is no log loss.
If anyone sees duplicated logs and this new error message, please let me know, that would be interesting. (This did not occur in my testing).
Thanks for detailed explanation. 👍🏼
@psyhomb This is expected. The results of my investigation showed that the old sequence token errors should be failures, since they represented cases where we did not get a full response from the CW API, which means we can't trust that the logs were actually received.
So now I make it fail unless we get a full response with a request ID.
I'm not certain why this happens in the first place- why we get a 200 response but it seems to be incomplete. I noticed in my own testing that this 200 response where we can't get the sequence token is never followed by an invalid sequence token exception in the next call. Which almost certainly indicates that CW didn't actually uptake the last batch of logs.
There is some longer term investigation needed here on the deeper root cause, which is something like one of the following options:
- CW API somehow returns an incomplete response sometimes
- Fluent Bit is somehow incorrectly processing the response and thinks its valid when it isn't.
However, for the short term, the fix that was just released should ensure there is no log loss.
If anyone sees duplicated logs and this new error message, please let me know, that would be interesting. (This did not occur in my testing).
In Fargate, we tried to adopt emf to CloudWatch, but we encountered the same issue, these logs are popping up 10-15 times per hour. we've used the optimized configuration which sets Flush to 1s[1], would it be the issue I guess?
In our other service which receives more requests, these errors appear 40-50 times per hour
[1] https://github.com/aws/aws-for-fluent-bit/blob/mainline/configs/minimize-log-loss.conf
Be careful when using low flush intervals because you can easily hit CW Logs limits that way. https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html
5 requests per second per log stream. Additional requests are throttled. This quota can't be changed.
@ShawnZong Can you share the exact log messages?
@PettitWesley Hi, this is the log messages
@ShawnZong thanks. If you can enable debug logging and show me those logs, that might be interesting since I added this debug line: https://github.com/fluent/fluent-bit/blob/master/plugins/out_cloudwatch_logs/cloudwatch_api.c#L1363
I apologize that this issue continues... I have been entirely unable to figure out a better solution to it than the current stopgap (which prevents log loss but does lead to errors). AFAICT, this issue does seem to be correlated with getting close to CW Log Stream ingestion limits. So anything which helps you stay farther away from those limits might help. That might mean increasing the flush interval. Depends on your exact use case and throughput and many details.
@PettitWesley Hi, the debug log shows that the response body is empty
@PettitWesley I've investigated for a while, I think it does look like we're getting closer to something related CW Log Stream ingestion limits. I've tested out fluent bit with and without this optimized configuration[1]. It turns out that when I use this optimized configuration which set flush timeout to 1s, and Grace to 30s, it would always cause the x-amzn-RequestId header not found error. However, if I just use the default setting which has a Flush timeout of 5s, the header error disappeared[2]. Btw I think the error in [2] is normal? because I was load testing it, and it could bring network issue if too many log events want to send at the same time?
I've created an AWS support ticket, and Yuriy M. from CloudWatch Team said she would have a look at this issue though she couldn't answer the reason for now. Nevertheless, she reminded me to check the CloudWatch quota:
The following limits apply to the PutLogEvents API call [3][4]:
1. 5 requests per second per log stream. Additional requests are throttled. This quota cannot be changed 2. The maximum batch size of a PutLogEvents request is 1MB 3. Total of 800 transactions per second per account per Region. (Except for the following Regions where the quota is a total of 1500 transactions per second per account per Region: US East (N. Virginia), US West (Oregon), and Europe (Ireland)
In Fluent Bit, you can avoid throttling issues by sending log events to a unique log stream per task. Furthermore, it would be best to experiment with increasing the Flush setting in the Service section to be greater than 5 seconds (the default), this might increase the number of events sent per flush, which could decrease the total number of calls to CloudWatch [5].
References: [1] https://github.com/aws/aws-for-fluent-bit/blob/mainline/configs/minimize-log-loss.conf [2] [3]. https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html [4]. https://docs.aws.amazon.com/general/latest/gr/api-retries.html [5]. https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/configuration-file
@ShawnZong Thank you! And I agree, my own testing strongly suggested that this is caused by throttling. One other piece of evidence for this is that I am not aware of an equivalent issue with any of the other Fluent Bit core plugins- there are other plugins that need to check the response body. If this was a bug in Fluent Bit's networking, I'd expect to see other plugins having equivalent issues, since all core plugins share the same networking code.
It is still strange though that Fluent Bit either actually receives or thinks it has received a partial http response with a 200 code but no response body.
However, if I just use the default setting which has a Flush timeout of 5s, the header error disappeared[2]. Btw I think the error in [2] is normal? because I was load testing it, and it could bring network issue if too many log events want to send at the same time?
Yes this is my understanding as well. I have been told by the CW API team that one of their throttling mechanisms is simply to block connections from the same IP. This then leads to Fluent Bit outputting a connection error message.
FWIW, I also changed from flush 1
to flush 5
recently, and I did notice way fewer (if not totally disappears) warnings like "Recieved code 200 but response was invalid..." than it was before.
Even with high amount of logs I only saw "[error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with error='ThrottlingException', message='Rate exceeded for logStreamName
Hey folks, I had another idea on this one, if anyone is willing please try out the following image and let me know what you see in the logs:
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:latest
This image has the repo policy here and can be pulled from any AWS account.
I think this patch might reduce the frequency with which you see these errors.
I think this patch might reduce the frequency with which you see these errors.
Hi @PettitWesley - we are facing this issue in a couple of our EKS clusters also. I've just updated to your image in one cluster and will keep an eye out. Will keep you posted...cheers.
Hmmm, seeing pod restarts due to OOM with this one (i'm using the same configs per AWS/EKS docs) just with the patched image:
State: Running
Started: Thu, 09 Sep 2021 12:01:13 +0100
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 09 Sep 2021 11:17:06 +0100
Finished: Thu, 09 Sep 2021 12:01:12 +0100
Ready: True
Restart Count: 4
Limits:
memory: 200Mi
Requests:
cpu: 500m
memory: 100Mi
To be honest we have been enough all kind of different issues with Fluent Bit, we switched to Filebeat OSS and so far everything is working flawlessly.
@byrneo this might be the reason? https://docs.fluentbit.io/manual/administration/buffering-and-storage
hey, thanks @ShawnZong - indeed, I am using the AWS configs referenced from here
They do appear to have set: Mem_Buf_Limit 5MB
Just mentioning it in the context of having tried @PettitWesley's patched image ( I don't recall hitting OOM before)
@byrneo Can you share your config? That mem_buf_limit is a bit low for most application logging use cases... I think in the container insights config map the inputs for system logs only have this low limit...
Let me also double check that image for memory leaks and re-build you a new image from an older stable version with my patch...
@byrneo Can you share your config? That mem_buf_limit is a bit low for most application logging use cases... I think in the container insights config map the inputs for system logs only have this low limit...
Sure @PettitWesley - just note, i didn't modify any of the limits in the configMap (i just used the ones linked to in the containerInsights docs)
<snip>
application-log.conf: |
[INPUT]
Name tail
Tag application.*
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/fluent-bit*
Parser docker
DB /var/fluent-bit/state/flb_log.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/cloudwatch-agent*
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser cwagent_firstline
Parser docker
DB /var/fluent-bit/state/flb_cwagent.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels Off
Annotations Off
[OUTPUT]
Name cloudwatch_logs
Match application.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/application
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
dataplane-log.conf: |
[INPUT]
Name systemd
Tag dataplane.systemd.*
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
DB /var/fluent-bit/state/systemd.db
Path /var/log/journal
Read_From_Tail ${READ_FROM_TAIL}
[INPUT]
Name tail
Tag dataplane.tail.*
Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
DB /var/fluent-bit/state/flb_dataplane_tail.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name modify
Match dataplane.systemd.*
Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit
Rename MESSAGE message
Remove_regex ^((?!hostname|systemd_unit|message).)*$
[FILTER]
Name aws
Match dataplane.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match dataplane.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/dataplane
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
host-log.conf: |
[INPUT]
Name tail
Tag host.dmesg
Path /var/log/dmesg
Parser syslog
DB /var/fluent-bit/state/flb_dmesg.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.messages
Path /var/log/messages
Parser syslog
DB /var/fluent-bit/state/flb_messages.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.secure
Path /var/log/secure
Parser syslog
DB /var/fluent-bit/state/flb_secure.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
<snip>
I can post the entire config if needed
Let me also double check that image for memory leaks and re-build you a new image from an older stable version with my patch...
Cool - thanks!
@byrneo Sorry, the latest
tag I pushed to that ECR repo was built off of the latest commits in the 1.8 branch which may not be completely stable.
I've built the same patch on top of a 3 different versions which my team is using for performance testing right now, so we are fairly confident these are stable and do not leak memory. (I should also note that high memory usage or spikes in memory != memory leak, you might just have too many logs and not enough allocation for the fluent bit pod). Let me know if you (or anyone else) wants me to build this patch on top of any other versions:
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.8.2
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.7.9
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.7.5
Thanks @PettitWesley - would it be possible to patch on top of fluent-bit 1.8.6
which seems to be used by https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.19.1
@byrneo Please try:
144718711470.dkr.ecr.us-west-2.amazonaws.com/invalid-request-possible-fix:1.8.6
hi @PettitWesley - i gave that image a try today. No OOM terminations on any of the instances (this is on a 3x node cluster), but did have one Pod restart after a few hours of operation:
State: Running
Started: Mon, 13 Sep 2021 13:34:18 +0100
Last State: Terminated
Reason: Error
Exit Code: 139
Started: Mon, 13 Sep 2021 10:39:00 +0100
Finished: Mon, 13 Sep 2021 13:34:16 +0100
as well as these (from the same instance):
[2021/09/13 14:29:29] [ warn] [http_client] malformed HTTP response from logs.us-east-1.amazonaws.com:443 on connection #174
[2021/09/13 14:29:29] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/09/13 14:29:29] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
[2021/09/13 14:29:29] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events
[2021/09/13 14:29:29] [ warn] [engine] failed to flush chunk '1-1631543369.515077516.flb', retry in 9 seconds: task_id=3, input=tail.0 > output=cloudwatch_logs.0 (out_id=0)
[2021/09/13 14:35:00] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
[2021/09/13 14:52:04] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
Of the other two instances, no warnings seen from one, and just this single warning from the other:
2021/09/13 11:44:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=126888450 watch_fd=40
[2021/09/13 11:48:14] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
[2021/09/13 11:49:42] [ info] [input:tail:tail.0] inotify_fs_add(): inode=197138454 watch_fd=42 name=/var/log/containers/<name_redacted>.log
@byrneo So the fix seems to have possibly worked? The http buffer warnings is a separate thing which we submitted a fix for already. Not sure about the single crash, we have seen some stability problems with the recent fluent bit releases. But let's focus on the main issue here. Please let me know if that resurfaces. If it doesn't then I will submit my patch to the upstream, and hopefully this ticket can be resolved.
@PettitWesley - ok, after another 12 hours. Still no errors from 2/3 instances. On the instance that threw the error yesterday, just one more occurrence overnight:
[2021/09/14 05:35:36] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
[2021/09/14 05:35:36] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send log events
[2021/09/14 05:35:36] [error] [output:cloudwatch_logs:cloudwatch_logs.2] Failed to send events
[2021/09/14 05:35:36] [ warn] [engine] failed to flush chunk '1-1631597730.696905125.flb', retry in 6 seconds: task_id=2, input=tail.6 > output=cloudwatch_logs.2 (out_id=2)
Hi folks,
We've been running the 2.12.0 release to ship our logs to CloudWatch with the new
cloudwatch_logs
plugin. We've been waiting for the fix to renewing STS tokens so this is our first outing with it.After running reliably for several hours, several of our pods have crashed with:
After that it exits with an error status and Kubernetes replaces the pod.
Curiously, several replicas of fluentbit failed with the same error at once. This makes me wonder if the CloudWatch API was briefly unavaiable. But if so, I'd expect the behaviour to be that it retries rather than taking down the whole fluentbit replica.